SciNet Tutorial

User Tutorial
SciNet HPC Consortium Compute Canada July 9, 2012

1 Introduction 2
1.1 General Purpose Cluster (GPC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Tightly Coupled System (TCS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Accelerator Research Cluster (ARC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Power 7 System (P7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Storage space and data management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Acknowledging SciNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Using the GPC 5
2.1 Login . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Software modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Compiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Testing and debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Running jobs though the Moab queuing system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5.1 Example job script: OpenMP job (without hyperthreading) . . . . . . . . . . . . . . . . . . . . . 13
2.5.2 Example job script: MPI job on two nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.3 Example job script: hybrid MPI/OpenMP job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.4 Example job script: bunch of serial runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Using the Other SciNet Systems 14
3.1 Login . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Software modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Compiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Testing and debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Running your jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5.1 Example job script: OpenMP job on the TCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5.2 Example job script: MPI job on the TCS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5.3 Example job script: hybrid MPI/OpenMP job on the TCS . . . . . . . . . . . . . . . . . . . . . . . 19
3.5.4 Example job script: GPU job on the ARC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
A Brief Introduction to the Unix Command Line 20
B GPC Quick Start Guide 22
1
SciNet User Tutorial 2
1 Introduction
SciNet is a consortium for High-Performance Computing made up of researchers at the University of Toronto
and its associated hospitals. It is part of Compute/Calcul Canada, as one of seven consortia in Canada provid-
ing HPC resources to their own academic researchers, other users in Canada and international collaborators.
SciNet runs a unix-type environment. Users not familiar with such an environment, please read appendix A.
Any qualied researcher at a Canadian university can get a SciNet account through this two-step process:
Register for a Compute Canada Database (CCDB) account at ccdb.computecanada.org
Non-faculty need a sponsor (supervisors CCRI number), who has to have a SciNet account already.
Login and apply for a SciNet account (click Apply beside SciNet on the Consortium Accounts page)
For groups who need more than the default amount of resources, the PI must apply for it through the
competitively awarded account allocation process once a year, in the fall. Without such an allocation, a user
may still use up to 32 nodes (256 cores) of the General Purpose Cluster at a time at low priority.
The SciNet wiki, wiki.scinethpc.ca , contains a wealth of information about using SciNets systems and
announcements for users, and also shows the current system status. Other forms of support:
Users are kept up-to-date with development and changes on the SciNet systems through a monthly email.
Monthly SciNet User Group lunch meetings including one or more TechTalks.
Classes, such as the Intro to SciNet, 1-day courses on parallel I/O, (parallel) programming, and a full-
term graduate course on scientic computing. The courses web site is support.scinet.utoronto.ca/courses
Past lecture slide and some videos can be found on the Tutorials and Manuals page.
Usage reports are available on the SciNet portal portal.scinet.utoronto.ca .
If you have problems, questions, or requests, and you couldnt nd the answer in the wiki, send an e-mail
to support@scinet.utoronto.ca. The SciNet team can help you with wide range problems such as more
efcient setup of your runs, parallelizing or optimizing your code, and using and installing libraries.
SciNet has currently two main production clusters available for users, and two smaller ones:
1.1 General Purpose Cluster (GPC)
3864 nodes with 8 cores each (two 2.66GHz quad-core Intel Xeon 5500 Intel processors)
HyperThreading lets you run 16 threads per node efciently.
16GB RAM per node
Running CentOS 6.2 (a linux distribution derived from Red Hat).
Interconnected by InniBand: non-blocking DDR on 1/4 of the nodes, 5:1 blocking QDR on the rest
328 TFlops #16 on the June 2009 TOP500 list of supercomputer sites (#1 in Canada)
Moab/Torque schedules by node with a maximum wall clock time of 48 hours.
1.2 Tightly Coupled System (TCS)
104 nodes with 32 cores (16 dual-core 4.7GHz POWER6 processors).
Simultaneous MultiThreading allows two tasks to be very efciently bound to each core.
128GB RAM per node
Running AIX 5.3L operating system.
Interconnected by full non-blocking DDR InniBand
62 TFlops #80 on the June 2009 TOP500 list of supercomputer sites
Moab/LoadLeveler schedules by node. The maximum wall clock time is 48 hours.
Access to this highly specialized machine is not enabled by default. For access, email us explaining the
nature of your work. Your application should scale well to 64 processes/threads to run on this system.
1.3 Accelerator Research Cluster (ARC)
Eight GPU devel nodes and four NVIDIA Tesla M2070. Per node:
2 quad-core Intel Xeon X5550 2.67GHz
48 GB RAM
2 GPUs with CUDA capability 2.0 (Fermi) each with 448 CUDA cores @ 1.15GHz and 6 GB of RAM.
Interconnected by DDR InniBand
16.48 TFlops from the GPUs in single precision
8.24 TFlops from the GPUs in double precision
Running CentOS 6.0 linux.
Moab/Torque schedules jobs similarly as on the GPC, with a maximum wall clock time of 48 hours.
Access disabled by default. Email us if you want access.
1.4 Power 7 System (P7)
Five nodes with four 8-core 3.3 GHz Power7 processors
128 GB RAM per node
Simultaneous MultiThreading allows four tasks to be very efciently bound to each core.
DDR Inniband interconnect
4.2 TFlops theoretical.
Running Red Hat Enterprise Linux 6.0.
LoadLeveler schedules by node. The maximum wall clock time is 48 hours.
Accessable to TCS users
1.5 Storage space and data management
1790 1TB SATA disk drives, for a total of 1.4 PB of storage
Two DCS9900 couplets, each delivering 4-5GB/s read/write access to the drives
Single GPFS le system on the TCS, GPC and ARC.
I/O shares the same InniBand network used for parallel jobs on the clusters.
HPSS: a tape-backed storage expansion solution, only for users with a substantial storage allocation.
The storage at SciNet is divided over different le systems:
le system quota block-size time-limit backup devel comp
/home 10GB 256kB indenite yes read/write read-only
/scratch 20TB or 1M les 4MB 3 months no read/write read/write
/project by allocation 4MB indenite yes read/write read/write
/archive (on HPSS) by allocation - indenite no on HPSS only
Of these four, /home and /scratch are the two most important ones, while the access to the latter two is
restricted to users with an appropriate storage allocation.
Every SciNet user gets a 10GB directory on /home (called /home/G/GROUP/USER, where GROUP is your
group name, G is the rst (lower case) letter of the group name and USER is your user name). For example,
user rzon in the group scinet has his user directory under /home/s/scinet/rzon. The home directory is
regularly backed up. Do not keep many small les on the system. They waste quite a bit of space. On home,
with a block size of 256kB, you can at most have 40960 les no matter how small they are, so you would
run out of disk quota quite rapidly with many small les.
On the compute nodes of the GPC /home is mounted read-only; thus GPC jobs can read les in /home
but cannot write to les there. /home is a good place to put code, input les for runs, and anything
else that needs to be kept to reproduce runs. In addition, every SciNet user gets a directory in /scratch
(/scratch/G/GROUP/USER). Note: the environment variables $HOME and $SCRATCH contain the location
of your home and scratch directory, respectively.
In $SCRATCH, up to 20TB could be stored although there is not enough room for each user to do this! In
addition, there is a limit of one million les for $SCRATCH. Because $HOME is read-only on compute nodes,
$SCRATCH is where jobs would normally write their output. Note that there are NO backups of scratch.
Furthermore, scratch is purged routinely. The current policy is that les which have not been accessed for
over three months will be deleted, with users getting a two-week notice on what les are to be deleted.
File transfers to and from SciNet
To transfer less than 10GB to SciNets, you can use the login nodes. The login nodes are visible from outside
SciNet, which means that you can transfer data to and from your own machine to SciNet using scp or rsync
starting from SciNet or from your own machine. The login node has a cpu time out of 5 minutes, which
means that even if you tried to transfer more than 10GB, you would probably not succeed.
Large transfers of data (more than 10GB) to or from SciNet are best done from the datamover1 or data-
mover2 node. From any of the interactive SciNet nodes, one can ssh to datamover1 or datamover2. These
machines have the fastest network connection to the outside world (by a factor of 10; a 10Gb/s link as
vs 1Gb/s). Datamover2 is sometimes under heavy load for sysadmin purposes, but datamover1 is for user
trafc only.
Transfers must be originated from the datamovers; that is, one can not copy les from the outside world
directly to or from a datamover node; one has to log in to that datamover node and copy the data to or from
the outside network. Your local machine must be reachable from the outside, either by its name or its IP
address. If you are behind a rewall or a (wireless) router, this may not be possible. You may need to ask
your system administrator to allow datamover to ssh to your machine.
I/O on SciNet systems
The compute nodes do not contain hard drives, so there is no local disk available to use during your compu-
tation. The available disk space, i.e., the home and scratch directories, are all part of the GPFS le system
which runs over the network. GPFS is a high-performance le system which provides rapid reads and writes
to large data sets in parallel from many nodes. As a consequence of this design, however, it performs quite
poorly at accessing data sets which consist of many, small les. Furthermore, the le system is a shared
resource. Creating many small les or opening and closing les with small reads, and similar inefcient I/O
practices hurt your jobs performance, and are felt by other users too.
Because of this le system setup, you may well nd that you have to reconsider the I/O strategy of your
program. The following points are very important to bear in mind when designing your I/O strategy
Do not read and write lots of small amounts of data to disk. Reading data in from one 4MB le can be
enormously faster than from 100 40KB les.
Unless you have very little output, make sure to write your data in binary.
Having each process in an MPI run write to a le of its own is not a scalable I/O solution. A directory gets
locked by the rst process accessing it, so the other processes have to wait for it. Not only has the code
just become considerably less parallel, chances are the le system will have a time-out while waiting for
your other processes, leading your program to crash mysteriously. Consider using MPI-IO (part of the
MPI-2 standard), NetCDF or HDF5, which allow les to be opened simultaneously by different processes.
You could also use dedicated process for I/O to which all other processes send their data, and which
subsequently writes this data to a single le.
If you must read and write a lot to disk, consider using the ramdisk. On the GPC, you can use up to
11GB of a compute nodes ram like a local disk. This will reduce how much memory is available for your
program. The ramdisk can be accessed using /dev/shm. Anything written to this location that you want
to preserve must be copied back to the scratch le system as /dev/shm is wiped after each job.
1.6 Acknowledging SciNet
In publications based on results from SciNet computations, please use the following acknowledgment:
Computations were performed on the <systemname> supercomputer at the SciNet HPC Consor-
tium. SciNet is funded by: the Canada Foundation for Innovation under the auspices of Compute
Canada; the Government of Ontario; Ontario Research Fund - Research Excellence; and the Uni-
versity of Toronto.
where you replace <systemname> by GPC or TCS. Also please cite the SciNet datacentre paper:
Chris Loken et al., SciNet: Lessons Learned from Building a Power-efcient Top-20 System and
Data Centre, J. Phys.: Conf. Ser. 256, 012026 (2010).
In any talks you give, please feel free to use the SciNet logo, and images of GPC, TCS, and the data centre.
These can be found on the wiki page Acknowledging SciNet .
We are very interested in keeping track of SciNet-powered publications! We track these for our own interest,
but such publications are also useful evidence of scientic merit for future resource allocations as well. Please
email details of any such publications, along with PDF preprints, to support@scinet.utoronto.ca.
2 Using the GPC
Using SciNets resources is signicantly different from using a desktop machine. The rest of this document is
will guide you through the process of using the GPC rst (as most users will make use of that system), while
details of the other systems are given afterwards.
A reference sheet for the GPC can be found in Appendix B at the end of this document.
Computing on SciNets clusters is done through a batch system. In its simplest form, it is a four stage process:
1. Login with ssh to the login nodes and transfer les.
These login nodes are gateways, you do not run or compile on them!
2. Ssh to one of the development nodes gpc01-04, where you load modules, compile your code and write
a script for the batch job.
3. Move the script, input data, etc. to the scratch disk, as you cannot write to your home directory from the
compute nodes. Submit the job to a queuing system.
4. After the scheduler has run the job on the compute nodes (this can take some time), and the job is
completed, deal with the output of the run.
2.1 Login
Access to the SciNet systems is via secure shell (ssh) only. Ssh to the gateway login.scinet.utoronto.ca:
$ ssh -X -l <username> login.scinet.utoronto.ca
The -X ag is there to set up X forwarding so that you could run Xwindows applications, such as graphical
editors and debuggers. If the -X ag does not work for you try the less secure -Y ag.
The login nodes are a front end to the data centre, and are not part of the GPC. For anything but small le
transfer and viewing your les, the next step is to ssh in to one of the devel nodes gpc01,...,gpc04, e.g.
$ ssh -X gpc03
These develop nodes have the same architecture as the compute nodes, but with more memory.
The SciNet rewall monitors for too many connections, and will shut down access (including previously
connections) from your IP address if more than four connection attempts are made within a few minutes.
In that case, you will be locked out of the system for an hour. Be patient in attempting new logins!
More about ssh and logging in from Windows can be found on the wiki page Ssh .
2.2 Software modules
Most software and libraries on the GPC have to be loaded using the module command. This allows us to
keep multiple versions for different users, and it allows users to easily switch between versions. The module
system sets up environment variables (PATH, LD LIBRARY PATH, etc.).
Basic usage of the module command is as follows
module load <module-name> to use particular software
module unload <module-name> to stop using particular software
module switch <module1> <module2> to unload module1 and load module2
module purge to remove all currently loaded modules
module avail to list available software packages (+ all versions)
module list to list currently loaded modules in your shell
You can load frequently used modules in the le .bashrc in your home directory, but be aware that this le
is read by any script that your run (including job scripts and the mpi wrapper scripts mpicc, mpif90, etc.),
Many modules are available in several versions (e.g. intel/12 and intel/12.1.3). When you load a module
with its short name (the part before the slash /,e.g., intel), you get the most recent and recommended
version of that library or piece of software. In general, you probably want to use the short module name,
especially since we may upgrade to a new version and deprecate the old one. By using the short module
name, you ensure that your existing module load commands still work. However, for reproducibility of your
runs, record the full names of loaded modules.
Library modules dene the environment variables pointing to the location of library les, include les and
the base directory for use Makeles. The names of the library, include and base variables are as follows:
SCINET_[shortmodulename]_LIB
SCINET_[shortmodulename]_INC
SCINET_[shortmodulename]_BASE
That means that to compile code that uses that package you add the following ags to the command line
-I${SCINET_[shortmodulename]_INC}
while to the link command, you have to add
-L${SCINET_[shortmodulename]_LIB}
before the necessary link ags (-l. . . ).
On July 9, 2012, the module list for the GPC contained:
intel, gcc, intelmpi, openmpi, nano, emacs, xemacs, autoconf, cmake, git, scons,
svn, ddt, ddd, gdb, mpe, openspeedshop, scalasca, valgrind, padb, grace, gnuplot,
vmd, ferret, ncl, ROOT, paraview, pgplot, ImageMagick,netcdf, parallel-netcdf,
ncview, nco, udunits, hdf4, hdf5, encfs, gamess, nwchem, gromacs, cpmd,
blast, amber, gdal, meep, mpb, R, petsc, boost, gsl, fftw, intel, extras, clog,
gnu-parallel, guile, java, python, ruby, octave, gotoblas, erlang, antlr, ndiff, nedit,
automake, cdo, upc, inteltools, cmor, ipm, cxxlibraries, Xlibraries, dcap, xml2, yt
Mathematical libraries supporting things like BLAS and FFT are part of modules as well: The Intels Math
Kernel Library (MKL) is part of the intel module and the goto-blas modules, and there are separate fftw
modules (although mkl supports this as well).
Other commercial packages (MatLab, Gaussian, IDL,...) are not available for licensing reasons. But
Octave, a highly MatLab-compatible open source alternative, is available as a module.
A current list of available software is maintained on the wiki page Software and Libraries .
2.3 Compiling
The GPC has compilers for C, C++, Fortran (up to 2003 with some 2008 features), Co-array Fortran, and
Java. We will focus here on the most commonly used languages: C, C++, and Fortran.
It is recommended that you compile with the Intel compilers, which are icc, icpc, and ifort for C, C++,
and Fortran. These compilers are available with the module intel (i.e., put module load intel in your
.bashrc). If you really need the GNU compilers, recent versions of the GNU compiler collection are available
as modules, with gcc,g++,gfortran for C, C++, and Fortran. The ol g77 is not supported, but both ifort
and gfortran are able to compile Fortran 77 code.
Optimize your code for the GPC machine using at least the following compilation ags -O3 -xhost, e.g.
$ ifort -O3 -xhost example.f example
$ icc -O3 -xhost example.c example
$ icpc -O3 -xhost example.cpp example
(the equivalent ags for GNU compilers are -O3 -march=native).
Compiling OpenMP code
To compile programs using shared memory parallel programming using OpenMP, add -openmp to the com-
pilation and linking commands, e.g.
$ ifort -openmp -O3 -xhost omp_example.f -o omp_example
$ icc -openmp -O3 -xhost omp_example.c -o omp_example
$ icpc -openmp -O3 -xhost omp_example.cpp -o omp_example
Compiling MPI code
Currently, the GPC has following MPI implementations installed:
1. Open MPI, in module openmpi (default version: 1.4.4)
2. Intel MPI, in module intelmpi (default versions: 4.0.3)
You can choose which one to use with the module system, but you are recommended to stick to OpenMPI
unless you have a good reason not to. Switching between MPI implementations is not always obvious.
Once an mpi module is loaded, MPI code can be compiled using mpif77/mpif90/mpicc/mpicxx, e.g.,
$ mpif77 -O3 -xhost mpi_example.f -o mpi_example
$ mpif90 -O3 -xhost mpi_example.f90 -o mpi_example
$ mpicc -O3 -xhost mpi_example.c -o mpi_example
$ mpicxx -O3 -xhost mpi_example.cpp -o mpi_example
These commands are wrapper (bash) scripts around the compilers which include the appropriate ags to
use MPI libraries. Because these are bash scripts, it is advisable not to load the mpi module of your choice
in your .bashrc if you intend to try out diffent mpi implementations or versions. All examples below will,
however, assume that the .bashrc loads the intel module.
Hybrid MPI/OpenMP applications are compiled with same commands, but with openmp ags, e.g.
$ mpif77 -openmp -O3 -xhost hybrid_example.f -o hybrid_example
$ mpif90 -openmp -O3 -xhost hybrid_example.f90 -o hybrid_example
$ mpicc -openmp -O3 -xhost hybrid_example.c -o hybrid_example
$ mpicxx -openmp -O3 -xhost hybrid_example.cpp -o hybrid_example
For hybrid OpenMP/MPI code using Intel MPI, add the compilation ag -mt_mpi for full thread-safety.
2.4 Testing and debugging
Apart from compilation, the devel nodes may also be used for short, small scale test runs (on the order of
a few minutes), although there is also a specialized queue for that (see next section). It is important to test
your jobs requirements and scaling behaviour before submitting a large scale computations to the queuing
system. Because the devel nodes are used by everyone who needs to use the GPC, be considerate.
To run a short test of a serial (i.e., non-parallel) program, simply type from a devel node
$ ./<executable> [arguments]
Serial production jobs must be bunched together to use all 8 cores; see below.
To run a short 4-thread OpenMP run on the GPC, type
$ OMP_NUM_THREADS=4 ./<executable> [arguments]
To run a short 4-process MPI run on a single node, type
$ mpirun -np 4 ./<executable> [arguments]
For debugging, we highly recommend DDT, Allineas graphical parallel debugger. It is available in the module
ddt. DDT can handle serial, openmp, mpi, as well as gpu code (useful on the ARC system). To enable
debugging in your code, you have to compile it with the ags -g, and you probably want to dial down the
optimization level to -O1 or even -O0 (no optimization). After loading the ddt module, simply start ddt with
ddt, and follow the graphical interfaces questions.
Larger, multi-node debugging debugging should be done on the dedicated debug nodes which can be ac-
cessed through the debug queue. The queuing system will be explained in more detail below. As a prelim-
inary example, to start a thiry-minute ddt debugging session on three nodes for a 24 process mpi program,
you can do the following from a gpc devel node:
$ qsub -l nodes=3:ppn=8,walltime=30:00 -X -I -q debug
qsub: waiting for job <jobid>.gpc-sched to start
...wait until you get a prompt...
qsub: job <jobid>.gpc-sched ready
--- --- --- --- --- --- --- --- --- ---
Begin PBS Prologue <datestamp>
Job ID: <jobid>.gpc-sched
Username: <username>
Group: <groupname>
Nodes: <node1> <node2> <node3>
End PBS Prologue <datestamp>
--- --- --- --- --- --- --- --- --- ---
$ module load <your-libraries>
$ module load Xlibraries
$ module load ddt
$ ddt
...follow the ddt menus on screen...
Note: the GNU debugger (gdb), the Intel debugger (idbc/idb) and a graphical debugger called ddd, are
available on the GPC as well.
2.5 Running jobs though the Moab queuing system
To run a job on the compute nodes, it must be submitted to a queue. The queuing system used on the GPC
is based around the Moab Workload Manager, with Torque (PBS) as the back-end resource manager. The
queuing system will send the jobs to the compute nodes. It schedules by nodes, so you cannot request e.g. a
two-core job. It is the users responsibility to make sure that the node is used efciently, i.e., all cores on a
node are kept busy.
Job submission starts with a script that species what executable to run, from which directory to run it, on
how many nodes, with how many threads, and for how long. A job script can be submitted to a queue with
$ qsub script.sh
This submits the job to the batch queue and assigns the job a jobid. There are two other queues on the
GPC, largemem and debug, whose use will be explained below.
Once the job is incorporated into the queue (which can take a minute), you can use:
$ showq
to show the all jobs in the queue. To just see your jobs, type
$ showq -u <username>
and to see only your running jobs, you can use
$ showq -r -u <username>
There are also job-specic commands such as showstart <jobid>, checkjob <jobid>, canceljob <jobid>,
to estimate when a job will start, to check the status of your job, and to cancel a job (we recommend not
using the torque commands qdel, etc.).
Jobs scripts serve a dual purpose:
1. they specify what resources your job needs, and for how long;
2. and they contain the command to be executed (on the rst node).
The rst purpose is accomplished without interfering with the second by using special command lines starting
with #PBS , typically at the top of the script. After the #PBS, one can specify resource options. Only one is
mandatory:
-l: species requested nodes and time, e.g.
-l nodes=1:ppn=8,walltime=1:00:00
-l nodes=2:ppn=8,walltime=1:00:00
The :ppn=8 part is mandatory as well, since scheduling goes by 8-core node.
To make your jobs start faster, reduce the requested time (walltime) to be closer to the estimated run
time (perhaps adding about 10 percent to be sure). Shorter jobs are scheduled sooner than longer ones.
Other resource options are
-N gives your job a name (so its easily identied in the queue).
-q: species the queue, e.g.
-q largemem
-q debug
This is not necessary for the regular batch queue, which is the default.
-o: change the default name of the le to contain any output to standard output from your job.
-e: change the default name of the le to contain any output to standard error from your job.
Note: the output and error les are not available until the job is nished.
After the resource option, one species the commands to be run, just as in a regular shell script. Your default
shell used for the command line as well as for scripts is bash, but it is possible to use csh as well for job
scripts. The shell that should execute a script is specied in the very rst line of the script, and should read
either #!/bin/bash or #!/bin/csh.
Runs are to be executed from $SCRATCH, because your $HOME is read-only on the compute nodes. The
easiest way to ensure that your runs starts from a directory that you have write access to, is to have copy
all the les necessary for your run to a work directory under $SCRATCH, to invoke the qsub command from
that directory, and to have as the rst line of your job script:
cd $PBS_O_WORKDIR
If you dont, the scheduler will start your job in $HOME. The environment variable PBS_O_WORKDIR is set by
the scheduler to the submission directory.
Here is a simple example of a job script for an openmp application to run with 16 threads:
#!/bin/bash
#PBS -l nodes=1:ppn=8
#PBS -l walltime=1:00:00
#PBS -N simple-openmp-job
cd $PBS_O_WORKDIR
./openmp_example
The reason this job uses 16 threads although the node has 8 cores, is that the cpus on the GPC nodes have
HyperThreading enabled. HyperThreading allows efcient switching between tasks, and makes it seem to
the operating system like there are 16 logical cpus rather than 8 on each node. Using this requires no changes
to the code, only running 16 rather than 8 tasks on the node. By default, OpenMP applications will use all
logical cpus, i.e., 16 threads, unless the environment variable OMP_NUM_THREADS is set do a different value.
E.g., to disable HyperThreading for OpenMP application, use export OMP_NUM_THREADS=8. In contrast, MPI
application always need to be told explicitly how many processes to use. Thus, to use HyperThreading for a
single node mpi job, it is enough to set -np 16 instead of -np 8 (hybrid mpi/openmp applications require a
bit more care, see example 2.5.3 below) .
Monitoring jobs
Once submitted, checkjob <jobid> can give you information on a job (after a short delay). It will tell you
if your job is runnning, idle, or blocked.
Blocked jobs are usually just jobs that the scheduler will not consider yet because the user of the group
that the user belongs to has reached their limit on the number of jobs or the number of cores to be used
simultaneously. Once running jobs from that user or group nish, these jobs will be moved from the blocked
list to the idle list.
Idle in this context means that your job will be considered to run by the scheduler, based on a mechanism
called fair-share. In this mechanism, the main factors for when a job will run are the priority of your group
(based on the groups allocation and previous usage) and how long the job has been in the idle queue. In
addition, the scheduler performs backlling, which means that if, to schedule a large multinode job, it has
to keep some nodes unused for a while, it will put jobs lower on the idle list on these reserved nodes that t
within that time and node slot. This may seem like jumping the queue, but it maximizes utilization of the
GPC without delaying the start time of other queued jobs. Keep in mind, however, that start times of jobs
in the queue can change if a user with a higher priority submits a job. Barring that, an estimate of the start
time of an idle job can be obtained using
$ showstart <jobid>
Once the job is running, you can still use checkjob, but you can also ssh to any of the nodes on which it is
running (listed by checkjob <jobid>). This allows you to monitor the progress and behaviour of your job
in more detail, using e.g. the top command. It also allows you to check the output and error messages from
your job on the y. They are located in the directory /var/spool/torque/spool, which exists only on the
(rst) node assigned to your job.
After a job has nished, these les with standard output and error are copied to the submission directory. By
default the les are named <jobname>.o<jobid> and <jobname>.e<jobid>, respectively. In addition to any
output that your application write to standard output, the .o le contains PBS information such as how long
your job took and how much memory it required. The .e le contains any error messages. If something goes
wrong with a job of yours, inspect these les carefully for hints on what went wrong.
Large memory jobs
There are 84 GPC nodes with 32 GB of memory instead of 16GB, which are of the same architecture as the
regular GPC compute and devel nodes. To request these, you use the regular batch queue, but add a ag
m32g to the node request, i.e.
#PBS -q batch
#PBS -l nodes=1:ppn=8:m32g
In addition, there are two GPC nodes with 128GB ram and 16 cores, intended for one-off data analysis or
visualization that may require such resources. Because these have a different architecture than the rest of
the GPC nodes, they have there own queue, and you have to compile code for these nodes on one of the GPC
devel nodes without the -xHost compilation ag.
To request a large memory nodes for a job, specify
#PBS -q largemem
#PBS -l nodes=1:ppn=16
in your job script (although ppn=8 will be accepted too).
Interactive jobs
Most PBS parameters in #PBS lines can also be given as parameters to qsub, but it is advisable to keep them
in the job script so you have a record on how you submitted your job. The exception is when you request an
interactive job, which is accomplished by giving the -I ag to qsub, e.g.
$ qsub -l nodes=1:ppn=8,walltime=1:00:00 -X -q debug -I
In addition to a interactive job, this requests the debug queue instead of the regular batch queue. This gives
access to a small number of reserved debug nodes. These nodes are the same as the usual compute nodes, but
jobs have a higher turnover in the debug queue than in the regular queue i.e., they start sooner because
only short jobs are allowed. The debug queue is ideal for short multinode tests and for debugging. Finally,
the -X ags in the above qsub command requests that X is forwarded, which is essential if you are going to
use ddt to debug. Note that this will only work if you gave the -X (or -Y) ag to each ssh command (i.e., at
a minimum from your machine to login.scinet.utoronto.ca, and from the login node to gpc01..gpc04).
The interactive ag -I works in combination with the largemem and batch queues as well. However, its use
with the batch queue is discouraged, as it may take a very long time for you to get a prompt.
Queue limits
queue min.time max.time max jobs max cores
batch 15m 48h 32, 1000 w/allocation 256, 8000 w/allocation
debug 2h/30m 1 walltime dependent, between 16 and 64
largemem 15m 48h 1 16 (32 threads)
Serial jobs on the GPC
SciNet is a parallel computing resource, and our priority will always be parallel jobs. Having said that, if you
can make efcient use of the resources using serial jobs and get good science done, thats acceptable too.
There is however no queue for serial jobs, so if you have serial jobs, you will have to bunch them together to
use the full power of a node (Moab schedules by node).
The GPC nodes each have 8 processing cores, and making efcient use of these nodes means using all eight
cores. As a result, wed like users to run multiples of 8 jobs at a time. The easiest way to do this is to bunch
the jobs in groups of 8 that will take roughly the same amount of time.
It is important to group the programs by how long they will take. If one job takes 2 hours and the rest
running on the same node only take 1, then for one hour 7 of the 8 cores on the GPC node are wasted; they
are sitting idle but are unavailable for other users, and the utilization of this node is only 56 percent.
You should have a reasonable idea of how much memory the jobs require. The GPC compute nodes have
about 14GB in total available to user jobs running on the 8 cores. So the jobs have to be bunched in ways
that will t into 14GB. If thats not possible, one could in principle run fewer jobs so that they do t.
Another highly recommended method is using GNU parallel, which can do the load balancing for you. See
the wiki page User Serial .
2.5.1 Example job script: OpenMP job (without hyperthreading)
#!/bin/bash
#PBS -l nodes=1:ppn=8,walltime=6:00:00
#PBS -N openmp-test
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=8
./openmp_example
2.5.2 Example job script: MPI job on two nodes
#!/bin/bash
#PBS -N mpi-test
cd $PBS_O_WORKDIR
module load openmpi
mpirun -np 16 ./mpi_example
When using hyperthreading, add the parameter --mca mpi_yield_when_idle 1 to mpirun.
For MPI code using Intel MPI with hyperthreading, add -genv I_MPI_SPIN_COUNT 1 to mpirun.
2.5.3 Example job script: hybrid MPI/OpenMP job
#!/bin/bash
#PBS -N hybrid-test
cd $PBS_O_WORKDIR
module load openmpi
mpirun --bynode -np 6 ./hybrid_example
The --bynode option is essential; without it, MPI processes bunch together in eights on each node.
For Intel MPI, that option needs to be replaced by -ppn 2.
In addition, for hybrid OpenMP/MPI code using Intel MPI, make sure you have
export I_MPI_PIN_DOMAIN=omp in .bashrc or in the job script.
2.5.4 Example job script: bunch of serial runs
#!/bin/bash
#PBS -N serialx8-test
cd $PBS_O_WORKDIR
(cd jobdir1; ./dojob1) &
wait # Without this, the job will terminate immediately, killing the 8 runs you just started
3 Using the Other SciNet Systems
3.1 Login
As the login nodes are a front end to the data centre, and are not part of any two compute cluster, For
anything but small le transfer and viewing your les, you next login to the TCS, ARC or P7 through their
development nodes (tcs01 or tcs02 for TCS, arc01 for ARC, and p701 for P7, respectively).
3.2 Software modules
As on the GPC, most software and libraries have to be loaded using the module command.
Frequently used modules may be loaded in the .bashrc in your $HOME, but make sure that it distinguishes
between the different clusters, as they all use the same .bashrc (see Important .bashrc guidelines ).
On July 9, 2012, the module list for the ARC contained:
intel, gcc, intelmpi, openmpi, cuda, nano, emacs, xemacs, autoconf, cmake, git, scons,
svn, ddt, ddd, gdb, mpe, openspeedshop, scalasca, valgrind, padb, grace, gnuplot,
vmd, ferret, ncl, ROOT, paraview, pgplot, ImageMagick,netcdf, parallel-netcdf,
ncview, nco, udunits, hdf4, hdf5, encfs, gamess, nwchem, gromacs, cpmd,
blast, amber, gdal, meep, mpb, R, petsc, boost, gsl, fftw, intel, extras, clog,
gnu-parallel, guile, java, python, ruby, octave, gotoblas, erlang, antlr, ndiff, nedit,
automake, cdo, upc, inteltools, cmor, ipm, cxxlibraries, Xlibraries, dcap, xml2, yt
The module list for the TCS contains :
upc, xlf, vacpp, mpe, scalasca, hdf4, hdf5, extras, netcdf, parallel-netcdf, nco, gsl, antlr,
ncl, ddt, fftw, ipm
The IBM compilers are standard available on the TCS and do not require a module to be loaded, although
newer versions may be installed as modules.
Math software supporting things like BLAS and FFT is either standard available, or part of a module: on
the ARC (as on the GPC), there is the Intels Math Kernel Library (MKL) which is part of the intel module
and the goto-blas modules, while on the TCS, IBMs ESSL high performance math library is standard
available.
A current list of available software is maintained on the wiki page Software and Libraries .
3.3 Compiling
TCS compilers
The TCS has compilers for C, C++, Fortran (up to 2003), UPC, and Java. We will focus here on the most
commonly used languages: C, C++, and Fortran.
The compilers are xlc,xlC,xlf for C, C++, and Fortran compilations. For OpenMP or other threaded appli-
cations, one has to use re-entrant-safe versions xlc_r,xlC_r,xlf_r. For MPI applications, mpcc,mpCC,mpxlf
are the appropriate wrappers. Hybrid MPI/OpenMP applications require mpcc_r,mpCC_r,mpxlf_r.
We strongly suggest the compilation ags
-O3 -q64 -qhot -qarch=pwr6 -qtune=pwr6
For OpenMP programs, you should add
-qsmp=omp
In the link command, we suggest using
-q64 -bdatapsize:64k -bstackpsize:64k
supplemented by
-qsmp=omp
for OpenMP programs.
To use the full C++ bindings of MPI (those in the MPI namespace) with the IBM c++ compilers, add -cpp
to the compilation line. If youre linking several c++ object les, add -bh:5 to the link line.
P7 compilers
Compilation for the P7 should be done with the IBM compilers on the devel node p701. The compilers
are xlc,xlC,xlf for C, C++, and Fortran compilations and become accessible by loading the modules
vacpp and xlf, respectively. For OpenMP or other threaded applications, one has to use re-entrant-safe
versions xlc_r,xlC_r,xlf_r. For MPI applications, mpcc,mpCC,mpxlf are the appropriate wrappers. Hybrid
MPI/OpenMP applications require mpcc_r,mpCC_r,mpxlf_r. We suggest the compilation ags
-O3 -q64 -qhot -qarch=pwr7 -qtune=pwr7
For OpenMP programs, add, for with compilation and linking,
-qsmp=omp
ARC compilers
To compile cuda code for runs on the ARC, you log in from login.scinet.utoronto.ca to the devel node:
$ ssh -X arc01
The ARC has the same compilers for C, C++, Fortran as the GPC, and in addition has the PGI and NVIDIA
cuda compilers for GPGPU computing. To use the cuda compilers, you have to load a cuda module. The
current default version of cuda is 4.1, but modules for cuda 3.2, 4.0 and 4.2 are installed as well. The cuda
c/c++ compiler is called nvcc. To optimize your code for the ARC architecture (and access all their cuda
capabilities), use at least the following compilation ags
-O3 -arch=sm_20
To use the PGI compiler, you have to
$ module load gcc/4.4.6 pgi
The compilers are pgfortran, pgcc and pgcpp. These compilers support CUDA Fortran and OpenACC using
the -acc -ta=nvidia -Mcuda=4.0 options.
3.4 Testing and debugging
TCS testing and debugging
Short test runs are allowed on devel nodes if they only dont use much memory and only use a few cores.
To run a short 8-thread OpenMP test run on tcs02:
$ OMP_NUM_THREADS=8 ./<executable> [arguments]
To run a short 16-process MPI test run on tcs02:
$ mpiexec -n 16 ./<executable> [arguments] -hostfile <hostfile>
<hostfile> should contain as many of the line tcs-f11n05 or tcs-f11n05 (depending on whether
youre on tcs01 or tcs02) as you want processes in the MPI run.
Furthermore, the le .rhosts in your home directory has to contain a line with tcs-f11n06.
The standard debugger on the TCS is called dbx. The DDT debugger is available in the module ddt.
P7 testing and debugging
This works as on the TCS, but with different hostle with up to 128 lines containing p07n01.
ARC testing and debugging
Short test runs are allowed on the devel node arc01. For GPU applications, simply run the executable. If you
use MPI and/or OpenMP as well, follow the instructions for the GPC. Note that because arc01 is a shared
resource, the GPUs may be busy, and so you may experience a lag in you programs starting up.
The NVIDIA debugger for cuda programs is cuda-gdb. The DDT debugger is available in the module ddt.
3.5 Running your jobs
As for the GPC, to run a job on the compute nodes you must submit it to a queue. You can submit jobs from
the devel nodes in the form of a script that species what executable to run, from which directory to run it,
on how many nodes, with how many threads, and for how long. The queuing system used on the TCS and
P7 is LoadLeveler, while Torque is used on the ARC. The queuing system will send the jobs to the compute
nodes. It schedules by node, so you cannot request e.g. a two-core job. It is the users responsibility to make
sure that the node is used efciently, i.e., all cores on a node are kept busy.
Examples of job scripts are given below. You can use these example scripts as starting points for your own.
Note that it is best to run fromthe scratch directory, because your home directory is read-only on the compute
nodes. Since the scratch directory is not backed up, copy essential results to $HOME after the run.
TCS queue
For the TCS, there is only one queue:
queue time(hrs) max jobs max cores
verylong 48 2/25 64/800 (128/1600 threads)
Submitting is done from tcs01 or tcs02 with
$ llsubmit <script>
and llq shows the queue.
As for the GPC, the job script is a shell script that serves two purposes:
1. It species what the requirements of the job in special comment lines starting with #@.
2. Once the required nodes are available (i.e., your job made it through the queue), the scheduler runs
the script on the rst node of the set of nodes. To run on multiple nodes, the script has to use poe. It
is also possible to give the command to run as one of the requirement options.
There are a lot of possible settings in a loadleveler script. Instead of writing your own from script, it is more
practical to take one of the examples of job scripts given below and adapt it to suit your needs.
The POWER6 processors have a facility called Simultaneous MultiThreading which allows two tasks to
be very efciently bound to each core. Using this requires no changes to the code, only running 64 rather
than 32 tasks on the node. For OpenMP application, see if setting OMP_NUM_THREADS and THRDS_PER_-
TASK to a number larger than 32 makes your job run faster. For MPI, increase tasks_per_node>32.
Once your job is in the queue, you can use llq to show the queue, and job-specic commands such as
llcancel, llhold, ...
Do not run serial jobs on the TCS! The GPC can do that, of course, in bunches of 8.
To make your jobs start sooner, reduce the wall_clock_limit)to be closer to the estimated run time
(perhaps adding about 10 % to be sure). Shorter jobs are scheduled sooner than longer ones.
P7 queue
The P7 queue is similar to the P6 queue. For differences, see the wiki page on the P7 Linux Cluster .
ARC queue
There is only one queue for the ARC:
arc min.time max.time max cores max gpus
batch 15m 48h 32 8
This queue is integrated into the gpc queuing system.
You submit to the queue from arc01 or a gpc devel node with
$ qsub [options] <script> -q arc
where you will replace <script> with the le name of the submission script. Common options are:
-l: species requested nodes and time, e.g.
-l nodes=1:ppn=8:gpus=2,walltime=6:00:00
The nodes option is mandatory, and has to contain the ppn=8 part, since scheduling goes by node, and
each node has 8 cores!
Note that the gpus setting is per node, and nodes have 2 gpus.
It is presently probably best to request a full node.
-I species that you want an interactive session; a script is not needed in that case.
Once the job is incorporated into the queue, you can see whats queued with
$ showq -w class=arc
and use job-specic commands such as canceljob.
3.5.1 Example job script: OpenMP job on the TCS
#Specifies the name of the shell to use for the job
#@ shell = /usr/bin/ksh
#@ job_name = <some-descriptive-name>
#@ job_type = parallel
#@ class = verylong
#@ environment = copy_all; memory_affinity=mcm; mp_sync_qp=yes; \
# mp_rfifo_size=16777216; mp_shm_attach_thresh=500000; \
# mp_euidevelop=min; mp_use_bulk_xfer=yes; \
# mp_rdma_mtu=4k; mp_bulk_min_msg_size=64k; mp_rc_max_qp=8192; \
# psalloc=early; nodisclaim=true
#@ node = 1
#@ tasks_per_node = 1
#@ node_usage = not_shared
#@ output = $(job_name).$(jobid).out
#@ error = $(job_name).$(jobid).err
#@ wall_clock_limit= 04:00:00
#@ queue
export target_cpu_range=-1
cd /scratch/<username>/<some-directory>
## To allocate as close to the cpu running the task as possible:
export MEMORY_AFFINITY=MCM
## next variable is for OpenMP
## next variable is for ccsm_launch
export THRDS_PER_TASK=32
## ccsm_launch is a "hybrid program launcher" for MPI/OpenMP programs
poe ccsm_launch ./example
3.5.2 Example job script: MPI job on the TCS
#LoadLeveler submission script for SciNet TCS: MPI job
#@ initialdir = /scratch/<username>/<some-directory>
#@ executable = example
#@ arguments =
#@ tasks_per_node = 64
#@ node = 2
#@ notification = complete
#@ notify_user = <user@example.com>
#Dont change anything below here unless you know exactly
#why you are changing it.
#@ class = verylong
#@ rset = rset_mcm_affinity
#@ mcm_affinity_options = mcm_distribute mcm_mem_req mcm_sni_none
#@ cpus_per_core=2
#@ task_affinity=cpu(1)
#@ environment = COPY_ALL; MEMORY_AFFINITY=MCM; MP_SYNC_QP=YES; \
# MP_RFIFO_SIZE=16777216; MP_SHM_ATTACH_THRESH=500000; \
# MP_EUIDEVELOP=min; MP_USE_BULK_XFER=yes; \
# MP_RDMA_MTU=4K; MP_BULK_MIN_MSG_SIZE=64k; MP_RC_MAX_QP=8192; \
# PSALLOC=early; NODISCLAIM=true
# Submit the job
#@ queue
3.5.3 Example job script: hybrid MPI/OpenMP job on the TCS
To run on 3 nodes, each with 2 MPI processes that have 32 threads, create a le poe.cmdfile containing
ccsm_launch ./example
and create a script along the following lines
#@ shell = /usr/bin/ksh
#@ class = verylong
#@ environment = COPY_ALL; memory_affinity=mcm; mp_sync_qp=yes; \
# mp_rfifo_size=16777216; mp_shm_attach_thresh=500000; \
# mp_euidevelop=min; mp_use_bulk_xfer=yes; \
# mp_rdma_mtu=4k; mp_bulk_min_msg_size=64k; mp_rc_max_qp=8192; \
# psalloc=early; nodisclaim=true
#@ task_geometry = {(0,1)(2,3)(4,5)}
#@ core_limit = 0
#@ queue
export target_cpu_range=-1
cd /scratch/<username>/<some-directory>
export MEMORY_AFFINITY=MCM
export THRDS_PER_TASK=32:32:32:32:32:32
poe -cmdfile poe.cmdfile
wait
3.5.4 Example job script: GPU job on the ARC
#!/bin/bash
# Torque submission script for SciNet ARC
#PBS -l nodes=1:ppn=8:gpus=2
#PBS -l walltime=6:00:00
#PBS -N gpu-test
cd $PBS_O_WORKDIR
./example
A Brief Introduction to the Unix Command Line
As SciNet systems run a Unix-like environment, you need to know the basics of the Unix command line. With
many good Unix tutorials on-line, we will only give some of the most commonly used features.
Unix prompt
The Unix command line is actually a program called a shell. The shell shows a prompt, something like:
user@scinet01:/home/g/group/user$ _
At the prompt you can type your input, followed by enter. The shell then proceeds to execute your commands.
For brevity, in examples the prompt is abbreviated to $.
There are different Unix shells (on SciNet the default is bash) but their basic commands are the same.
Files and directories
Files are organized in a directory tree. A le is thus specied as /<path>/<file>. The directory separating
character is the slash /. For nested directories, paths are sequences of directories separated by slashes.
There is a root folder / which contains all other folders. Different le systems (hard disks) are mounted
somewhere in this global tree. There are no separate trees for different devices.
In a shell, there is always a current directory. This solves the impracticality of having to specify the full path
for each le or directory. The current directory is by default displayed in the prompt. You can refer to les in
the current directory simply by using their name. You can specify les in another directory by using absolute
(as before) or relative paths. For example, if the current directory is /home/g/group/user, the le a in the
directory /home/g/group/user/z can be referred to as z/a. The special directories . and .. refer to the
current directory and the parent directory, respectively.
Home directory
Each user has a home directory, often called /home/<user>, where <user> is replaced by your user name.
On scinet, this directorys location is group based (/home/<first letter of group>/<group>/<user>). By
default, les in this directory can be seen only by users in the same group. You cannot write to other users
home directory, nor can you read home directories of other groups, unless these have changed the default
permissions. The home directory can be referred to using the single character shorthand ~. On SciNet, you
have an additional directory at your disposal, called /scratch/g/group/<user>
Commands
Commands typed on the command line are either built-in to the shell or external, in which case the are a le
somewhere in the le system. Unless you specify the full path of an external command, the shell has to go
look for the corresponding le. The directories that it looks for are stored as a list separated by a colon (:)
in a variable called PATH. In bash, you can append a directory to this as follows:
$ export PATH="$PATH:<newpath>"
Common commands
command function
ls list the content of the given or of the current directory.
cat concatenate the contents of les given as arguments (writes to screen)
cd change the current directory to the one given as an argument
cp copy a le to another le
man show the help page for the command given as an argument
mkdir create a new directory
more display the le given as an argument, page-by-page
mv move a le to another le or directory
pwd show the current directory
rm delete a le (no undo or trash!)
rmdir delete a directory
vi edit a le (there are alternatives, e.g., nano or emacs)
exit exit the shell
Hidden les and directories
A le or directory of which the name starts with a period (.) is hidden, i.e., it will not show up in an ls
(unless you give the option -a). Hidden les and directories are typically used for settings.
Variables
Above we already saw an example of a shell variable, namely, PATH. In the bash shell, variables can have
any name and are assigned a value using equals, e.g., MYVAR=15. To use a variable, you type $MYVAR. To
make sure commands can use this variable, you have to export it, i.e., export MYVAR, if it is already set, or
export MYVAR=15 to set it and export it in one command.
Scripts
One can put a sequence of shell commands in a text le and execute it using its name as a command (or
by source <file>). This is useful for automating frequently typed commands, and is called a shell script.
Job scripts as used by the scheduler are a special kind of shell script. Another special script is the hidden le
~/.bashrc which is executed each time a shell is started.
G
P
C
Q
u
i
c
k
S
t
a
r
t
G
u
i
d
e
L
o
g
g
i
n
g
I
n
S
c
i
N
e
t
a
l
l
o
w
s
l
o
g
i
n
o
n
l
y
v
i
a
s
s
h
,
a
s
e
c
u
r
e
p
r
o
t
o
c
o
l
.
L
o
g
i
n
t
o
t
h
e
l
o
g
i
n
m
a
c
h
i
n
e
s
a
t
t
h
e
d
a
t
a
c
e
n
t
r
e
,
a
n
d
t
h
e
n
i
n
t
o
t
h
e
d
e
v
e
l
o
p
m
e
n
t
n
o
d
e
s
,
w
h
e
r
e
y
o
u
d
o
a
l
l
y
o
u
r
w
o
r
k
.
B
a
t
c
h
c
o
m
p
u
t
i
n
g
j
o
b
s
a
r
e
r
u
n
o
n
t
h
e
c
o
m
p
u
t
e
n
o
d
e
s
.
F
o
r
L
i
n
u
x
/
M
a
c
O
S
u
s
e
r
s
F
r
o
m
a
t
e
r
m
i
n
a
l
w
i
n
d
o
w
,
$
s
s
h
-
Y
[
U
S
E
R
]
@
l
o
g
i
n
.
s
c
i
n
e
t
.
u
t
o
r
o
n
t
o
.
c
a
s
c
i
n
e
t
0
1
-
$
s
s
h
-
Y
g
p
c
0
1
(
o
r
g
p
c
0
2
,
g
p
c
0
3
,
g
p
c
0
4
.
)
T
h
e
r
s
t
c
o
m
m
a
n
d
l
o
g
s
y
o
u
i
n
t
o
t
h
e
l
o
g
i
n
n
o
d
e
s
(
r
e
p
l
a
c
e
[
U
S
E
R
]
w
i
t
h
y
o
u
r
u
s
e
r
n
a
m
e
)
,
t
h
e
s
e
c
o
n
d
l
o
g
s
y
o
u
i
n
t
o
o
n
e
o
f
t
h
e
f
o
u
r
d
e
v
e
l
o
p
m
e
n
t
n
o
d
e
s
.
-
Y
a
l
l
o
w
s
X
w
i
n
d
o
w
s
p
r
o
g
r
a
m
s
t
o
p
o
p
u
p
w
i
n
d
o
w
s
o
n
y
o
u
r
l
o
c
a
l
m
a
c
h
i
n
e
.
F
o
r
W
i
n
d
o
w
s
u
s
e
r
s
F
o
r
s
s
h
w
e
s
u
g
g
e
s
t
:
T
h
e
c
y
g
w
i
n
e
n
v
i
r
o
n
m
e
n
t
(
h
t
t
p
:
/
/
c
y
g
w
i
n
.
c
o
m
)
,
a
l
i
n
u
x
-
l
i
k
e
e
n
v
i
r
o
n
m
e
n
t
.
B
e
s
u
r
e
t
o
i
n
s
t
a
l
l
X
1
1
a
n
d
O
p
e
n
S
S
H
,
a
n
d
y
o
u
c
a
n
t
h
e
n
(
a
f
t
e
r
l
a
u
n
c
h
i
n
g
t
h
e
X
1
1
c
l
i
e
n
t
)
r
u
n
t
h
e
c
o
m
m
a
n
d
s
l
i
s
t
e
d
a
b
o
v
e
;
o
r
M
o
b
a
X
t
e
r
m
(
m
o
b
a
x
t
e
r
m
.
m
o
b
a
t
e
k
.
n
e
t
)
,
a
t
a
b
b
e
d
s
s
h
c
l
i
e
n
t
.
M
a
c
h
i
n
e
D
e
t
a
i
l
s
E
a
c
h
G
P
C
n
o
d
e
h
a
s
8
p
r
o
c
e
s
s
o
r
s
,
1
4
G
B
o
f
f
r
e
e
m
e
m
o
r
y
,
a
n
d
s
u
p
p
o
r
t
s
u
p
t
o
1
6
t
h
r
e
a
d
s
o
r
p
r
o
c
e
s
s
e
s
.
J
o
b
s
a
r
e
a
l
l
o
c
a
t
e
d
e
n
t
i
r
e
n
o
d
e
s
a
n
d
m
u
s
t
m
a
k
e
f
u
l
l
u
s
e
o
f
e
a
c
h
.
M
o
d
u
l
e
s
S
o
f
t
w
a
r
e
i
s
a
c
c
e
s
s
e
d
b
y
l
o
a
d
i
n
g
m
o
d
u
l
e
s
w
h
i
c
h
p
l
a
c
e
t
h
e
p
a
c
k
a
g
e
i
n
y
o
u
r
e
n
v
i
r
o
n
m
e
n
t
.
m
o
d
u
l
e
a
v
a
i
l
L
i
s
t
p
a
c
k
a
g
e
s
.
m
o
d
u
l
e
l
o
a
d
[
p
k
g
]
U
s
e
d
e
f
a
u
l
t
v
e
r
s
i
o
n
o
f
[
p
k
g
]
.
m
o
d
u
l
e
l
o
a
d
[
p
k
g
]
/
[
v
.
]
U
s
e
v
e
r
s
i
o
n
[
v
.
]
o
f
[
p
k
g
]
.
m
o
d
u
l
e
u
n
l
o
a
d
[
p
k
g
]
R
e
m
o
v
e
[
p
k
g
]
f
r
o
m
p
a
t
h
.
m
o
d
u
l
e
p
u
r
g
e
R
e
m
o
v
e
a
l
l
p
a
c
k
a
g
e
s
f
r
o
m
p
a
t
h
.
C
o
m
m
o
n
m
o
d
u
l
e
s
:
m
o
d
u
l
e
l
o
a
d
g
c
c
i
n
t
e
l
g
c
c
,
i
n
t
e
l
c
o
m
p
i
l
e
r
s
.
m
o
d
u
l
e
l
o
a
d
o
p
e
n
m
p
i
O
p
e
n
M
P
I
m
o
d
u
l
e
l
o
a
d
i
n
t
e
l
m
p
i
I
n
t
e
l
M
P
I
(
r
e
c
o
m
m
e
n
d
e
d
)
E
d
i
t
o
r
s
T
e
x
t
-
b
a
s
e
d
e
d
i
t
o
r
s
a
r
e
m
o
r
e
r
e
s
p
o
n
s
i
v
e
o
v
e
r
a
n
e
t
w
o
r
k
c
o
n
n
e
c
t
i
o
n
t
h
a
n
g
r
a
p
h
i
c
a
l
e
d
i
t
o
r
s
,
b
u
t
b
o
t
h
a
r
e
a
v
a
i
l
a
b
l
e
.
v
i
f
i
l
e
n
a
m
e
v
i
e
d
i
t
o
r
.
g
v
i
m
f
i
l
e
n
a
m
e
v
i
e
d
i
t
o
r
(
g
r
a
p
h
i
c
a
l
)
m
o
d
u
l
e
l
o
a
d
e
m
a
c
s
e
m
a
c
s
f
i
l
e
n
a
m
e
e
m
a
c
s
e
d
i
t
o
r
(
t
e
x
t
)
.
e
m
a
c
s
-
x
f
i
l
e
n
a
m
e
e
m
a
c
s
e
d
i
t
o
r
(
g
r
a
p
h
i
c
a
l
)
m
o
d
u
l
e
l
o
a
d
n
a
n
o
n
a
n
o
f
i
l
e
n
a
m
e
S
i
m
p
l
e
n
a
n
o
e
d
i
t
o
r
(
t
e
x
t
)
.
D
i
s
k
/
h
o
m
e
/
[
U
S
E
R
]
1
0
G
B
,
b
a
c
k
e
d
u
p
.
/
s
c
r
a
t
c
h
/
[
U
S
E
R
]
L
a
r
g
e
c
a
p
a
c
i
t
y
;
n
o
t
b
a
c
k
e
d
u
p
;
p
u
r
g
e
d
e
v
e
r
y
3
m
o
n
t
h
s
.
m
o
d
u
l
e
l
o
a
d
e
x
t
r
a
s
S
h
o
w
s
u
s
e
r
,
g
r
o
u
p
d
i
s
k
u
s
a
g
e
d
i
s
k
U
s
a
g
e
o
n
/
h
o
m
e
,
/
s
c
r
a
t
c
h
.
A
l
l
S
c
i
N
e
t
n
o
d
e
s
s
e
e
t
h
e
s
a
m
e
l
e
s
y
s
t
e
m
s
.
/
h
o
m
e
c
a
n
o
n
l
y
b
e
r
e
a
d
f
r
o
m
o
n
t
h
e
c
o
m
p
u
t
e
n
o
d
e
s
;
b
a
t
c
h
j
o
b
s
m
u
s
t
b
e
r
u
n
f
r
o
m
/
s
c
r
a
t
c
h
.
T
h
e
s
h
a
r
e
d
d
i
s
k
s
y
s
t
e
m
i
s
o
p
t
i
m
i
z
e
d
f
o
r
h
i
g
h
b
a
n
d
w
i
d
t
h
l
a
r
g
e
r
e
a
d
s
a
n
d
w
r
i
t
e
s
.
U
s
i
n
g
m
a
n
y
s
m
a
l
l
l
e
s
,
o
r
d
o
i
n
g
m
a
n
y
s
m
a
l
l
i
n
p
u
t
s
a
n
d
o
u
t
p
u
t
s
,
i
s
i
n
e
f
c
i
e
n
t
a
n
d
s
l
o
w
s
d
o
w
n
t
h
e
l
e
s
y
s
t
e
m
f
o
r
a
l
l
u
s
e
r
s
.
C
o
p
y
i
n
g
F
i
l
e
s
s
c
p
c
o
p
i
e
s
l
e
s
v
i
a
t
h
e
s
e
c
u
r
e
s
s
h
p
r
o
t
o
c
o
l
.
S
m
a
l
l
(
f
e
w
G
B
)
l
e
s
m
a
y
b
e
c
o
p
i
e
d
t
o
o
r
f
r
o
m
t
h
e
l
o
g
i
n
n
o
d
e
s
.
E
g
,
f
r
o
m
y
o
u
r
l
o
c
a
l
m
a
c
h
i
n
e
,
t
o
c
o
p
y
l
e
s
f
r
o
m
S
c
i
N
e
t
,
s
c
p
-
C
[
U
S
E
R
]
@
l
o
g
i
n
.
s
c
i
n
e
t
.
u
t
o
r
o
n
t
o
:
/
[
P
a
t
h
T
o
F
i
l
e
]
[
L
o
c
a
l
P
a
t
h
T
o
N
e
w
F
i
l
e
]
c
o
p
i
e
s
[
P
a
t
h
T
o
M
y
l
e
]
t
o
t
h
e
l
o
c
a
l
d
i
r
e
c
t
o
r
y
.
T
o
c
o
p
y
a
l
e
t
o
S
c
i
N
e
t
:
s
c
p
-
C
[
m
y
f
i
l
e
]
[
U
S
E
R
]
@
l
o
g
i
n
.
s
c
i
n
e
t
.
u
t
o
r
o
n
t
o
:
/
[
P
a
t
h
T
o
N
e
w
F
i
l
e
]
L
a
r
g
e
l
e
s
m
u
s
t
b
e
s
e
n
t
t
h
r
o
u
g
h
d
a
t
a
m
o
v
e
r
n
o
d
e
s
;
s
e
e
t
h
e
S
c
i
N
e
t
w
i
k
i
f
o
r
d
e
t
a
i
l
s
.
R
u
n
n
i
n
g
J
o
b
s
I
t
i
s
o
k
t
o
r
u
n
s
h
o
r
t
(
f
e
w
m
i
n
u
t
e
)
,
s
m
a
l
l
-
m
e
m
o
r
y
t
e
s
t
s
o
n
t
h
e
d
e
v
e
l
o
p
m
e
n
t
n
o
d
e
s
.
O
t
h
e
r
s
m
u
s
t
b
e
r
u
n
o
n
t
h
e
c
o
m
p
u
t
e
n
o
d
e
s
v
i
a
t
h
e
q
u
e
u
e
s
,
f
r
o
m
t
h
e
/
s
c
r
a
t
c
h
d
i
r
e
c
t
o
r
y
.
D
e
b
u
g
q
u
e
u
e
A
s
m
a
l
l
n
u
m
b
e
r
o
f
c
o
m
p
u
t
e
n
o
d
e
s
a
r
e
s
e
t
a
s
i
d
e
f
o
r
a
d
e
b
u
g
q
u
e
u
e
,
a
l
l
o
w
i
n
g
s
h
o
r
t
j
o
b
s
(
u
n
d
e
r
2
h
o
u
r
s
)
t
o
r
u
n
q
u
i
c
k
l
y
.
T
o
g
e
t
a
s
i
n
g
l
e
d
e
b
u
g
n
o
d
e
f
o
r
a
n
h
o
u
r
t
o
r
u
n
i
n
t
e
r
a
c
t
i
v
e
l
y
,
q
s
u
b
-
I
-
l
n
o
d
e
s
=
1
:
p
p
n
=
8
,
w
a
l
l
t
i
m
e
=
1
:
0
0
:
0
0
-
q
d
e
b
u
g
a
n
d
o
n
e
c
a
n
r
u
n
a
s
i
f
o
n
e
w
e
r
e
l
o
g
g
e
d
i
n
t
o
t
h
e
d
e
v
e
l
n
o
d
e
s
.
O
n
e
c
a
n
a
l
s
o
r
u
n
s
h
o
r
t
d
e
b
u
g
n
o
d
e
s
i
n
b
a
t
c
h
m
o
d
e
.
B
a
t
c
h
q
u
e
u
e
T
h
e
u
s
u
a
l
u
s
a
g
e
o
f
S
c
i
N
e
t
i
s
t
o
b
u
i
l
d
a
n
d
c
o
m
p
i
l
e
y
o
u
r
c
o
d
e
o
n
/
h
o
m
e
,
t
h
e
n
c
o
p
y
t
h
e
e
x
e
c
u
t
a
b
l
e
a
n
d
d
a
t
a
l
e
s
t
o
a
d
i
r
e
c
t
o
r
y
o
n
/
s
c
r
a
t
c
h
,
w
r
i
t
e
a
s
c
r
i
p
t
w
h
i
c
h
d
e
s
c
i
b
e
s
h
o
w
t
o
r
u
n
t
h
e
j
o
b
,
a
n
d
s
u
b
m
i
t
i
t
t
o
t
h
e
q
u
e
u
e
.
W
h
e
n
r
e
s
o
u
r
c
e
s
a
r
e
f
r
e
e
,
y
o
u
r
j
o
b
r
u
n
s
t
o
c
o
m
p
l
e
t
i
o
n
.
J
o
b
s
i
n
t
h
e
b
a
t
c
h
q
u
e
u
e
m
a
y
r
u
n
n
o
l
o
n
g
e
r
t
h
a
n
4
8
h
o
u
r
s
p
e
r
s
e
s
s
i
o
n
.
S
a
m
p
l
e
s
c
r
i
p
t
s
f
o
l
l
o
w
.
S
a
m
p
l
e
b
a
t
c
h
s
c
r
i
p
t
-
M
P
I
#
!
/
b
i
n
/
b
a
s
h
#
P
B
S
-
l
n
o
d
e
s
=
2
:
p
p
n
=
8
R
e
q
u
e
s
t
2
n
o
d
e
s
#
P
B
S
-
l
w
a
l
l
t
i
m
e
=
1
:
0
0
:
0
0
.
.
f
o
r
1
h
o
u
r
.
#
P
B
S
-
N
t
e
s
t
J
o
b
n
a
m
e
c
d
$
P
B
S
_
O
_
W
O
R
K
D
I
R
c
d
t
o
s
u
b
m
i
s
s
i
o
n
d
i
r
.
m
p
i
r
u
n
-
n
p
1
6
[
p
r
o
g
]
R
u
n
p
r
o
g
r
a
m
w
/
1
6
t
a
s
k
s
.
S
a
m
p
l
e
b
a
t
c
h
s
c
r
i
p
t
-
O
p
e
n
M
P
#
!
/
b
i
n
/
b
a
s
h
#
P
B
S
-
l
n
o
d
e
s
=
1
:
p
p
n
=
8
R
e
q
u
e
s
t
1
n
o
d
e
#
P
B
S
-
l
w
a
l
l
t
i
m
e
=
1
:
0
0
:
0
0
.
.
f
o
r
1
h
o
u
r
.
#
P
B
S
-
N
t
e
s
t
J
o
b
n
a
m
e
c
d
$
P
B
S
_
O
_
W
O
R
K
D
I
R
c
d
t
o
s
u
b
m
i
s
s
i
o
n
d
i
r
.
e
x
p
o
r
t
O
M
P
_
N
U
M
_
T
H
R
E
A
D
S
=
8
R
u
n
w
i
t
h
8
O
p
e
n
M
P
t
h
r
e
a
d
s
[
p
r
o
g
]
>
j
o
b
.
o
u
t
R
u
n
,
s
a
v
e
o
u
t
p
u
t
i
n
j
o
b
.
o
u
t
.
S
a
m
p
l
e
b
a
t
c
h
s
c
r
i
p
t
-
S
e
r
i
a
l
J
o
b
s
I
t
i
s
a
l
s
o
p
o
s
s
i
b
l
e
t
o
r
u
n
b
a
t
c
h
e
s
o
f
8
s
e
r
i
a
l
j
o
b
s
o
n
a
n
o
d
e
t
o
m
a
k
e
s
u
r
e
t
h
e
n
o
d
e
i
s
f
u
l
l
y
u
t
i
l
i
z
e
d
.
I
f
a
l
l
t
a
s
k
s
t
a
k
e
r
o
u
g
h
l
y
t
h
e
s
a
m
e
a
m
o
u
n
t
o
f
t
i
m
e
:
#
!
/
b
i
n
/
b
a
s
h
#
P
B
S
-
l
n
o
d
e
s
=
1
:
p
p
n
=
8
R
e
q
u
e
s
t
1
n
o
d
e
#
P
B
S
-
l
w
a
l
l
t
i
m
e
=
1
:
0
0
:
0
0
.
.
f
o
r
1
h
o
u
r
.
#
P
B
S
-
N
s
e
r
i
a
l
x
8
J
o
b
n
a
m
e
c
d
$
P
B
S
_
O
_
W
O
R
K
D
I
R
c
d
t
o
s
u
b
m
i
s
s
i
o
n
d
i
r
.
(
c
d
j
o
b
d
i
r
1
;
.
/
d
o
j
o
b
1
)
&
S
t
a
r
t
t
a
s
k
1
(
c
d
j
o
b
d
i
r
2
;
.
/
d
o
j
o
b
2
)
&
.
.
.
(
c
d
j
o
b
d
i
r
8
;
.
/
d
o
j
o
b
8
)
&
w
a
i
t
W
a
i
t
f
o
r
a
l
l
t
o
n
i
s
h
F
o
r
m
o
r
e
c
o
m
p
l
i
c
a
t
e
d
c
a
s
e
s
,
s
e
e
t
h
e
w
i
k
i
.
Q
u
e
u
e
C
o
m
m
a
n
d
s
q
s
u
b
[
s
c
r
i
p
t
]
S
u
b
m
i
t
j
o
b
t
o
b
a
t
c
h
q
u
e
u
e
q
s
u
b
[
s
c
r
i
p
t
]
-
q
d
e
b
u
g
S
u
b
m
i
t
j
o
b
t
o
d
e
b
u
g
q
u
e
u
e
q
s
t
a
t
S
h
o
w
y
o
u
r
q
u
e
u
e
d
j
o
b
s
s
h
o
w
q
-
-
n
o
b
l
o
c
k
S
h
o
w
a
l
l
j
o
b
s
c
h
e
c
k
j
o
b
[
j
o
b
i
d
]
D
e
t
a
i
l
s
o
f
y
o
u
r
j
o
b
[
j
o
b
i
d
]
s
h
o
w
s
t
a
r
t
[
j
o
b
i
d
]
E
s
t
i
m
a
t
e
s
t
a
r
t
t
i
m
e
c
a
n
c
e
l
j
o
b
[
j
o
b
i
d
]
C
a
n
c
e
l
y
o
u
r
j
o
b
[
j
o
b
i
d
]
R
a
m
d
i
s
k
S
o
m
e
o
f
a
n
o
d
e
s
m
e
m
o
r
y
m
a
y
b
e
u
s
e
d
a
s
a
r
a
m
d
i
s
k
,
a
v
e
r
y
f
a
s
t
l
e
s
y
s
t
e
m
v
i
s
i
b
l
e
o
n
l
y
o
n
-
n
o
d
e
.
I
f
y
o
u
r
j
o
b
u
s
e
s
l
i
t
t
l
e
m
e
m
o
r
y
b
u
t
d
o
e
s
m
a
n
y
s
m
a
l
l
d
i
s
k
i
n
p
u
t
s
/
o
u
t
p
u
t
s
,
u
s
i
n
g
r
a
m
d
i
s
k
c
a
n
s
i
g
n
i
c
a
n
t
l
y
s
p
e
e
d
y
o
u
r
j
o
b
.
T
o
u
s
e
:
C
o
p
y
y
o
u
r
i
n
p
u
t
s
t
o
/
d
e
v
/
s
h
m
;
c
d
t
o
/
d
e
v
/
s
h
m
a
n
d
r
u
n
y
o
u
r
j
o
b
;
t
h
e
n
c
o
p
y
o
u
t
p
u
t
s
f
r
o
m
/
d
e
v
/
s
h
m
t
o
/
s
c
r
a
t
c
h
.
O
t
h
e
r
R
e
s
o
u
r
c
e
s
h
t
t
p
:
/
/
w
i
k
i
.
s
c
i
n
e
t
.
u
t
o
r
o
n
t
o
.
c
a
D
o
c
u
m
e
n
t
a
t
i
o
n
s
u
p
p
o
r
t
@
s
c
i
n
e
t
.
u
t
o
r
o
n
t
o
.
c
a
E
m
a
i
l
u
s
f
o
r
h
e
l
p

SciNet Tutorial

Uploaded by

SciNet Tutorial

Uploaded by

User Tutorial

SciNet HPC Consortium Compute Canada July 9, 2012

You might also like