1st Assignment On PDF
1st Assignment On PDF
1st Assignment On PDF
DEPARTMENT OF BIOLOGY
PREPARED BY:-
1. MOHAMMED ABDELLA GSR/0196/09
2. DALE ABO GSR/0200/09
3. KAMILA IBRAHIM GSR/ 0189/09
4. DUGO NURA GSR/0195/09
5. EDAO GELGELU GSR/0199/09
The human mitochondrial genome consists of a single type of circular double stranded DNA that
is 16.6 kilo bases in length. The overall base composition is 44% (G+C), but the two mtDNA strands
have significantly different base compositions: the heavy (H) strand is rich in guanines, but the
light (L) strand is rich in cytosine. Cells typically contain thousands of copies of the double-
stranded mtDNA molecule, but the number can vary considerably in different cell types.
During zygote formation, a sperm cell contributes its nuclear genome, but not its mitochondrial
genome, to the egg cell. Consequently, the mitochondrial genome of the zygote is usually
determined exclusively by that originally found in the unfertilized egg. The mitochondrial genome
is therefore maternally inherited: males and females both inherit their mitochondria from their
mother, but males do not transmit their mitochondria to subsequent generations. During mitotic
cell division, the multiple mtDNA molecules in a dividing cell segregate in a purely random way
to the two daughter cells.
The replication of both the H and L strands is unidirectional and starts at specific origins. Although
the mitochondrial DNA is principally double-stranded, repeat synthesis of a small segment of the
H-strand DNA produces a short third DNA strand called 7S DNA. The 7S DNA strand can base-pair
with the L strand and displace the H strand, resulting in a triple-stranded structure.
This region contains many of the mtDNA control sequences (including the major promoter
regions) and so it is referred to as the CR/D-loop region (where CR denotes control region, and
D-loop stands for displacement loop). The origin of replication for the H strand lies in the CR/D-
loop region, and that of the L strand is sandwiched between two tRNA genes. Only after about
two-thirds of the daughter H strand has been synthesized (by using the L strand as a template
and displacing the old H strand) does the origin for L-strand replication become exposed.
Thereafter, replication of the L strand proceeds in the opposite direction, using the H strand as a
template.
The human mitochondrial genome contains 37 genes, 28 of which are encoded by the H strand
and the other nine by the L strand. Whereas nuclear genes often have their own dedicated
promoters, the transcription of mitochondrial genes resembles that of bacterial genes.
Transcription of mtDNA starts from common promoters in the CR/D-loop region and continues
round the circle (in opposing directions for the two different strands), to generate large
multigenic transcripts. The mature RNAs are subsequently generated by cleavage of the
multigenic transcripts. Almost two-thirds (24 out of 37) of the mitochondrial genes specify a
functional non coding RNA as their final product. There are 22 tRNA genes, one for each of the
22 types of mitochondrial tRNA. In addition, two rRNA genes are dedicated to making 16S rRNA
and 12S rRNA (components of the large and small subunits, respectively, of mitochondrial
ribosomes). The remaining 13 genes encode polypeptides, which are synthesized on
mitochondrial ribosomes. These 13 polypeptides form part of the mitochondrial respiratory
complexes, the enzymes of oxidative phosphorylation that are engaged in the production of ATP.
The human mitochondrial genome is extremely compact: all 37 mitochondrial genes lack introns
and are tightly packed (on average, there is one gene per 0.45 kb). The coding sequences of some
genes (notably those encoding the sixth and eighth subunits of the mitochondrial ATP synthase)
show some overlap and, in most other cases, the coding sequences of neighboring genes are
contiguous or separated by one or two non-coding bases. Some genes even lack termination
codons; to overcome this deficiency, UAA codons have to be introduced at the post-
transcriptional level.
The human nuclear genome is 3.1 Gb (3100 Mb) in size. It is distributed between 24 different
types of linear double-stranded DNA molecule, each of which has histones and nonhistone
proteins bound to it, constituting a chromosome. There are 22 types of autosome and two sex
chromosomes, X and Y. Human chromosomes can easily be differentiated by chromosome
banding , and have been classified into groups largely according to size and, to some extent,
centromere position . There is a single nuclear genome in sperm and egg cells and just two copies
in most somatic cells, in contrast to the hundreds or even thousands of copies of the
mitochondrial genome. Because the size of the nuclear genome is about 186,000 times the size
of mtDNA molecule, however, the nucleus of a human cell typically contains more than 99% of
the DNA in the cell; the oocyte is a notable exception because it contains as many as 100,000
mtDNA molecules. Not all of the human nuclear genome has been sequenced. The Human
Genome focused primarily on sequencing euchromatin, the gene-rich, transcriptionally active
regions of the nuclear genome that account for 2.9 Gb. The other 200 Mb is made up of
permanently condensed and transcriptionally inactive (constitutive) heterochromatin. The
heterochromatin is composed of long arrays of highly repetitive DNA that are very difficult to
sequence accurately. For a similar reason, the long arrays of tandemly repeated transcription
units encoding 28S, 18S, and 5.8S rRNA were also not sequenced. The DNA of human
chromosomes varies considerably in length and also in the proportions of underlying
euchromatin and constitutive heterochromatin. Each chromosome has some constitutive
heterochromatin at the centromere. Certain chromosomes, notably 1, 9, 16, and 19, also have
significant amounts of heterochromatin in the euchromatic region close to the centromere
(pericentromere), and the acrocentric chromosomes each have two sizeable heterochromatic
regions. But the most significant representation is in the Y chromosome, where most of the DNA
is organized as heterochromatin.
The base composition of the euchromatic component of the human genome averages out at 41%
(G+C), but there is considerable variation between chromosomes, from 38% (G+C) for
chromosomes 4 and 13 up to 49% (for chromosome 19). It also varies considerably along the
lengths of chromosomes. For example, the average (G+C) content on chromosome 17q is 50%
for the distal 10.3 Mb but drops to 38% for the adjacent 3.9 Mb. There are regions of less than
300 kb with even wider swings, for example from 33.1% to 59.3% (G+C).
Human genes are unevenly distributed on the nuclear DNA molecules. The constitutive
heterochromatin regions are devoid of genes and, even within the euchromatic portion of the
genome, gene density can vary substantially between chromosomal regions and also between
whole chromosomes.
For many years, molecular geneticists believed that the major functional endpoint of DNA was
protein. Studies of prokaryotic genomes supported this belief, partly because these genomes are
rich in protein-coding DNA. It came as a surprise to find that the much larger genomes of complex
eukaryotes have comparatively little protein-coding DNA. The complete modular protein-coding
capacity of the genome is contained within the exone, and consists of DNA sequences encoded
by exons that can be translated into proteins. For example, protein-coding DNA sequences
account for close to 90% of the E. coli genome but just 1.1%-2 % of the human genome.
Figure4. Examplery picture of exones and introns in DNA sequence
Human protein-coding genes show enormous variation in size and internal organization Size
variation
Genes in simple organisms such as bacteria are comparatively similar in size and are usually very
short (typically about 1 kb long). In complex eukaryotes, genes can show huge variation in size.
Although there is generally a direct correlation between gene and product sizes, there are some
striking anomalies. For example, the giant 2.4 Mb dystrophin gene is more than 50 times the size
of the apolipoprotein B gene but the dystrophin protein has a linear length (total amino acid
number) that is about 80% of that of apolipoprotein B. A small minority of human protein-coding
genes (exons) lack introns and are generally small.
For those that do possess introns, there is an inverse correlation between gene size and fraction
of coding DNA. This does not arise because exons in large genes are smaller than those in small
genes. The average exon size in human genes is close to 300 bp, and exon size is comparatively
independent of gene length. Instead, there is huge variation in intron lengths, and large genes
tend to have very large introns. Transcription of long introns is, however, costly in time and
energy; transcription of the 2.4 Mb dystrophin genes takes about 16 hours. Thus, very highly
expressed genes often have short introns or no introns at all.
Different proteins can be specified by overlapping transcription units Overlapping genes and
genes-within-genes
In complex organisms, such as humans, genes are much bigger, and there is less clustering of
protein-coding sequences. Gene density varies enormously from chromosome to chromosome
and within different regions of the same chromosome. In chromosomal regions with high gene
density, overlapping genes may be found; they are typically transcribed from opposing DNA
strands. Whole genome analyses show that about 9% of human protein-coding genes overlap
another such gene. More than 90% of the overlaps involve genes transcribed from opposing
strands. Recent analyses have also shown that RNA genes can frequently overlap protein- coding
genes.
2. NON-CODING DNA (ncDNA)
Non-coding DNA is defined as all of the DNA sequences within a genome that are not found within
protein-coding exons, and so are never represented within the amino acid sequence of expressed
proteins. By this definition, more than 98% of the human genomes is composed of ncDNA.
Numerous classes of non-coding DNA have been identified, including genes for non coding RNA
(e.g. tRNA and rRNA), pseudogenes, introns, untranslated regions of mRNA, regulatory DNA
sequences, repetitive DNA sequences, and sequences related to mobile genetic elements.
Numerous sequences that are included within genes are also defined as non-coding DNA. These
include genes for noncoding RNA (e.g. tRNA, rRNA), and untranslated components of protein-
coding genes (e.g. introns, and 5' and 3' untranslated regions of mRNA).
Highly repetitive DNA sequences are often found within introns and flanking sequences of genes.
In addition, repetitive DNA sequences are found to different extents in exons . Tandem repetition
of very short oligonucleotide sequences (1–4 bp) is frequent and may simply reflect statistically
expected frequencies for certain base compositions.
In addition to the ncRNA molecules that are encoded by discrete genes, the initial transcripts of
protein coding genes usually contain extensive non-coding sequences, in the form of introns, 5'-
untranslated regions (5'-UTR), and 3'-untranslated regions (3'-UTR). Within most protein-coding
genes of the human genome, the length of intron sequences is 10- to 100-times the length of
exon sequences.
Protein-coding sequences (specifically, coding exons) constitute less than 1.5% of the human
genome. In addition, about 20%-26% of the human genome is introns. Aside from genes (exons
and introns) and known regulatory sequences (8–20%), the human genome contains regions of
non-coding DNA. The exact amount of non-coding DNA that plays a role in cell physiology has
been hotly debated.
Excluding protein-coding sequences, introns, and regulatory regions, much of the non-coding
DNA is composed of: Many DNA sequences that do not play a role in gene expression have
important biological functions. Comparative genomics studies indicate that about 5% of the
genome contains sequences of non coding DNA that are highly conserved, sometimes on time-
scales representing hundreds of millions of years, implying that these non-coding regions are
under strong evolutionary pressure and positive selection.
B. Regulatory DNA sequences
The human genome has many different regulatory sequences which are crucial to controlling
gene expression. Conservative estimates indicate that these sequences make up 5%-8% of the
genome, however extrapolations give that 20-40% of the genome is gene regulatory sequence.
Some types of non-coding DNA are genetic "switches" that do not encode proteins, but do
regulate when and where genes are expressed (called enhancers).
Regulatory sequences have been known since the late 1960s. The first identification of regulatory
sequences in the human genome relied on recombinant DNA technology. So computer
comparisons of gene sequences that identify conserved non-coding sequences will be an
indication of their importance in duties such as gene regulation.
Figure5. Picture for regulatory DNA sequence and other related sequences
Pseudogenes
Gene families frequently have defective gene copies in addition to functional genes. A defective
gene copy that contains at least multiple exons of a functional gene is known as a pseudogene.
Other defective gene copies may have only limited parts of the gene sequence, sometimes a
single exon, and so are sometimes described as gene fragments. Clustered gene families often
have defective gene copies that have arisen by tandem duplication. These are examples of non-
processed pseudogenes. Copying can be seen to have been performed at the level of genomic
DNA because non-processed pseudogenes contain counterparts of both exons and introns and
sometimes also of upstream promoter regions. However, even if the copy has sequences that
correspond to the full length of the functional gene, closer examination will identify
inappropriate termination codons in exons, aberrant splice junctions, and so on.
A few gene families that are distributed at different chromosomal locations can also have non-
processed pseudogene copies of a single functional gene. Certain types of sub-chromosomal
region, notably pericentromeric and subtelomeric regions, are comparatively unstable. They are
prone to recombination events that can result in duplicated gene segments (containing both
exons and introns) being distributed to other chromosomal locations. The gene copies are
typically defective because they lack some of the functional gene sequence. Two illustrative
examples are sequences related to the NF1 (neurofibromatosis type I) and the PKD1 (adult
polycystic kidney disease) genes.
Processed pseudogenes are defective copies of a gene that contain only exonic sequences and
lack an intronic sequence or upstream promoter sequences. They arise by retrotransposition:
cellular reverse transcriptases can use processed gene transcripts such as mRNA to make cDNA
that can then integrate into chromosomal DNA. Processed pseudogenes are common in
interspersed gene families. Processed pseudogenes lack a promoter sequence and so are
typically not expressed. Sometimes, however, the cDNA copy integrates into a chromosomal DNA
site that happens, by chance, to be adjacent to a promoter that can drive expression of the
processed gene copy. Selection pressure may ensure that the processed gene copy continues to
make a functional gene product, in which case it is described as a retrogene. A variety of
intronless retrogenes is known to have testis-specific expression patterns and are typically
autosomal homologs of an intron-containing X-linked gene. One rationale for retrogenes may be
a critical requirement to overcome the lack of expression of certain X-linked sequences in the
testis during male meiosis. During male meiosis, the paired X and Y chromosomes are converted
to heterochromatin, forming the highly condensed and transcriptionally inactive XY body.
Autosomal retrogenes can provide the continued synthesis in testis cells of certain crucially
important products that are no longer synthesized by genes in the highly condensed XY body.
Figure6. Picture for pseudogene and its transcriptional process
Non-coding RNA molecules play many essential roles in cells, especially in the many reactions of
protein synthesis and RNA processing. Non-coding RNA include tRNA, ribosomal RNA, microRNA,
snRNA and other non-coding RNA genes including about 60,000 long non coding RNAs (lncRNAs).
It should be noted that while the number of reported lncRNA genes continues to rise and the
exact number in the human genome is yet to be defined, many of them are argued to be non-
functional.
Many ncRNA are critical elements in gene regulation and expression. Non-coding RNA also
contributes to epigenetics, transcription, RNA splicing, and the translational machinery. The role
of RNA in genetic regulation and disease offers a new potential level of unexplored genomic
complexity.
Genes contain some repetitive DNA sequences, including repetitive coding DNA. However, the
majority of highly repetitive DNA sequences occur outside genes. Some of the sequences are
present at certain sub-chromosomal regions as large arrays of tandem repeats. This type of DNA,
known as heterochromatin, remains highly condensed throughout the cell cycle and does not
generally contain genes. Other highly repetitive DNA sequences are interspersed throughout the
human genome and were derived by duplicative transposition. Sequences like this are sometimes
described as transposon repeats and they account for more than 40% of the total DNA sequence
in the human genome. In addition to residing in extragenic regions, they are often found in
introns and untranslated sequences and sometimes even in coding sequences.
Constitutive heterochromatin is largely defined by long arrays of high-copy-number tandem
DNA repeats
The DNA of constitutive heterochromatin accounts for 200 Mb or 6.5% of the human genome. It
encompasses mega base regions at the centromeres and comparatively short lengths of DNA at
the telomeres of all chromosomes. Most of the Y chromosome and most of the short arms of the
acrocentric chromosomes (13, 14, 15, 21, and 22) consist of heterochromatin. In addition, there
are very substantial heterochromatic regions close to the centromeres of certain chromosomes,
notably chromomosomes 1, 9, 16, and 19. The DNA of constitutive heterochromatin mostly
consists of long arrays of high-copy-number tandemly repeated DNA sequences, known as
satellite DNA. Shorter arrays of tandem repeats are known as minisatellites and microsatellites,
respectively. Large tracts of heterochromatin are typically composed of a mosaic of different
satellite DNA sequences that are occasionally interrupted by transposon repeats but are devoid
of genes.
There are different satellite DNA organizations, and the repeated unit may be a very simple
sequence (less than 10 nucleotides long) or a moderately complex one that can extend to over
100 nucleotides long. At the sequence level, satellite DNA is often extremely poorly conserved
between species. Its precise function remains unclear, although some human satellite DNAs are
implicated in the function of centromeres whose DNA consists very largely of various families of
satellite DNA.
The centromere is an epigenetically defined domain. Its function is independent of the underlying
DNA sequence; instead, its function depends on its particular chromatin organization, which,
once established, has to be stably maintained through multiple cell divisions. Of the various
satellite DNA families associated with human centromeres, only the a-satellite is known to be
present at all human centromeres, and its repeat units often contain a binding site for a specific
centromere protein. Cloned a-satellite arrays have been shown to seed de novo centromeres in
human cells, indicating that a-satellite must have an important role in centromere function. The
specialized telomeric DNA consists of medium-sized arrays just a few kilo bases long and
constitutes a form of minisatellite DNA. Unlike satellite DNA, telomeric minisatellite DNA has
been extraordinarily conserved during vertebrate evolution and has an integral role in telomere
function. It consists of arrays of tandem repeats of the hexanucleotide TTAGGG that are
synthesized by the telomerase ribonucleoprotein.
Figure7. Picture indicates position of heterochromatin at centromeric region .
Transposon-
Derived repeats make up more than 40% of the human genome and arose mostly through RNA
intermediates. Almost all of the interspersed repetitive non-coding DNA in the human genome
is derived from transposons (also called transposable elements), mobile DNA sequences that can
migrate to different regions of the genome. Close to 45% of the genome can be recognized as
belonging to this class, but much of the remaining unique DNA must also be derived from ancient
transposon copies that have diverged extensively over long evolutionary time-scales. In humans
and other mammals there are four major classes of transposon repeat, but only a tiny minority
of transposon repeats are actively transposing.
• DNA transposons. Members of this fourth class of transposon migrate directly without any
copying of the sequence; the sequence is excised and then reinserted elsewhere in the genome
(a cut-and-paste mechanism). Transposable elements that can transpose independently are
described as autonomous; those that cannot are known as nonautonomous. Of the four classes
of transposable element, LINEs and SINEs predominate; we describe them more fully below.
LTR transposons include autonomous and non autonomous retrovirus-like elements that are
flanked by long terminal repeats (LTRs) containing necessary transcriptional regulatory elements.
Endogenous retroviral sequences contain gag and pol genes, which encode a protease, reverse
transcriptase, RNAse, and integrase. They are thus able to transpose independently. There are
three major classes of human endogenous retroviral sequence (HERV), with a cumulative copy
number of about 240,000, accounting for a total of about 4.6% of the human genome. Very many
HERVs are defective, and transposition has been extremely rare during the last several million
years.
Nonautonomous retrovirus-like elements lack the pol gene and often also the gag gene (the
internal sequence having been lost by homologous recombination between the flanking LTRs).
The MaLR family of such elements accounts for almost 4% or so of the genome.
DNA transposons have terminal inverted repeats and encode a transposase that regulates
transposition. They account for close to 3% of the human genome and can be grouped into
different classes that can be subdivided into many families with independent origins.
MER1 and MER2, plus a variety of less frequent families. Virtually all the resident human DNA
transposon sequences are no longer active; they are therefore transposon fossils. DNA
transposons tend to have short lifespans within a species, unlike some of the other transposable
elements such as LINEs. However, quite a few functional human genes seem to have originated
from DNA transposons, notably genes encoding the RAG1 and RAG2 recombinases and the major
centromere-binding protein.
A few human LINE-1 elements are active transposons and enable the transposition of other
types of DNA sequence
LINEs (long interspersed nuclear elements) have been very successful transposons. They have a
comparatively long evolutionary history, occurring in other mammals, including mice. As
autonomous transposons, they can make all the products needed for retrotransposition,
including the essential reverse transcriptase.
Human LINEs consist of three distantly related families: LINE-1, LINE-2, and LINE-3, collectively
comprising about 20% of the genome. They are located primarily in euchromatic regions and are
located preferentially in the dark AT-rich G bands (Giemsa-positive) of metaphase chromosomes.
Of the three human LINE families, LINE-1 (or L1) is the only family that continues to have actively
transposing members. LINE-1 is the most important human transposable element and accounts
for a higher fraction of genomic DNA (17%) than any other class of sequence in the genome. Full-
length LINE-1 elements are more than 6 kb long and encode two proteins: an RNA-binding protein
and a protein with both endonuclease and reverse transcriptase activities.
The LINE-1 machinery is responsible for most of the reverse transcription in the genome, allowing
retrotransposition of the non-autonomous SINEs and also of copies of mRNA, giving rise to
processed pseudogenes and retrogenes. Of the 6000 or so full-length LINE-1 sequences, about
80–100 are still capable of transposing, and they occasionally cause disease by disrupting gene
function after insertion into an important conserved sequence.
Alu repeats are the most numerous human DNA elements and originated as copies of 7SL RNA
SINEs (short interspersed nuclear elements) are retrotransposons about 100–400 bp in length.
They have been very successful in colonizing mammalian genomes, resulting in various
interspersed DNA families, some with extremely high copy numbers. Unlike LINEs, SINEs do not
encode any proteins and they cannot transpose independently. However, SINEs and LINEs share
sequences at their 3¢ end, and SINEs have been shown to be mobilized by neighboring LINEs. By
parasitizing on the LINE element transposition machinery, SINEs can attain very high copy
numbers.
The human Alu family is the most prominent SINE family in terms of copy number, and is the
most abundant sequence in the human genome, occurring on average more than once every 3
kb. The full-length Alu repeat is about 280 bp long and consists of two tandem repeats, each
about 120 bp in length followed by a short An/Tn sequence. The tandem repeats are asymmetric:
one contains an internal 32 bp sequence that is lacking in the other.
Alu repeats have a relatively high GC content and, although dispersed mainly throughout the
euchromatic regions of the genome, are preferentially located in the GC-rich and gene-rich R
chromosome bands, in striking contrast to the preferential location of LINEs in AT-rich DNA.
However, when located within genes they are, like LINE-1 elements, confined to introns and the
untranslated regions. Despite the tendency to be located in GC-rich DNA, newly transposing Alu
repeats show a preference for AT-rich DNA, but progressively older Alu repeats show a
progressively stronger bias toward GC-rich DNA.
It suggests that Alu repeats are not just genome parasites but are making a useful contribution
to cells containing them. Some Alu sequences are known to be actively transcribed and may have
been recruited to a useful function. The BCYRN1 gene, which encodes the BC200 neural
cytoplasmic RNA, arose from an Alu monomer and is one of the few Alu sequences that are
transcriptionally active under normal circumstances. In addition, the Alu repeat has recently been
shown to act as a trans-acting transcriptional repressor during the cellular heat shock response.