caricato da Utente5682

USING SEQUENCE ALIGNMENT ALGORITHMS IN S

Zagazig University
Faculty of Science
Mathematics Department
USING SEQUENCE ALIGNMENT ALGORITHMS IN SOME
APPLICATIONS IN BIOINFORMATICS
A Thesis
Submitted to Department of Mathematics, Faculty of Science, Zagazig
University, Egypt.
In Partial Fulfillment of Requirements of Award of M.Sc. Degree
(Mathematics and computer science)
By
Gaber Hassan Al-sayed Ahmed Abdelaal
B. Sc. (Mathematics and computer science (2007) Zagazig University, Egypt)
Teaching Assistant at Basic Science Department, Faculty of Engineering, Sinai
University, Egypt.
2014
‫بسم هللا الرحمن الرحيم‬
‫﴿‬
‫اّلل َعلَ ْي ِه َت َو َّك ْل ُ‬
‫ِيب‬
‫َو َما َت ْوفِيقِي إِالَّ ِب ه ِ‬
‫ت َوإِلَ ْي ِه أُن ُ‬
‫صدق هللا العظيم‬
‫سورة هود األيه ‪88‬‬
‫بسم هللا الرحمن الرحيم‬
‫﴿‬
‫َو َما أُو ِتي ُتم مِنَ ال ِعلم إِ اَّل َقلِ ً‬
‫يل‬
‫ِ‬
‫صدق هللا العظيم‬
‫سورة اإلسراء األيه ‪88‬‬
‫﴾‬
‫﴾‬
ACKNOWLEDGMENT
In the beginning, All praise due to ALLAH AL-WAHAB WHO guided and
aided me to bring forth this thesis, without this it would not have been possible.
I would like to take this opportunity to thank all those who helped me, and
contributed in many different ways.
To begin with I think it is rather fitting to acknowledge Dr Ahmed Mansour
Alzohairy, Dr Osama Abdo Mohamed, and Dr Mohamed Hussein Saleh. It has
been a pleasure to work with them, their expertise and guidance have proven
invaluable. That helped me while working in this thesis. Without their
assistance, valuable discussions and critical reading of the manuscript, this
work could not have been accomplished. They taught me a lot in the branch of
my study and show me the way how the problems could be solved.
Furthermore, I would like to express my thanks to all members of Department
of Mathematics, Faculty of Science Zagazig University for their constant help.
Finally, I wish to express my deep thanks and gratitude to my parents, my
sisters, for their encouragement during the preparation of this thesis.
Gaber. H. Alsayed
( 2014)
i
PUBLICATION
1.
A Comprehensive Study by Using Different Alignment
Algorithms to Demonstrate the Genetic Evolution of Heat Shock
Factor 1 (HSF1) in Different Eukaryotic Organisms, Published
in IRACST – Engineering Science and Technology: An
International Journal (ESTIJ), 3(2), (April 2013):376.
2.
Mathematical Modeling And Classification Of Viruses From
Herpesvirus Family, published in an International Journal of
Computer Applications (IJCA), Volume 87 – No.12,
February 2014.
ii
ABSTRACT
Bioinformatics is a pluri-disciplinary science focusing on the applications of
computational methods and mathematical statistics to molecular biology.
Bioinformatics is also called:
1) Computational Biology (USA)
2) Computational Molecular Biology
3) Computational Genomics
Studying bioinformatics is so important because it is an opportunity to use
some of the most interesting computational techniques to understand some of
the deep mysteries of life and diseases and hopefully to contribute to cure some
of the diseases that affect people. Bioinformatics combines algorithms in
computer science and statistics to analyze, understand, and engage in
hypothesis about the large repositories of collected biological data and
knowledge. One of the most important topics in bioinformatics is sequence
alignment or sequence comparison which we are concerned with it in this
thesis.
In comparison of bimolecular sequence (i.e., those of DNA, RNA, and
protein), regions of high sequence similarity often indicate significant
functional or structural similarity as well. The same and related molecular
structures and mechanisms are reused and modified during evolution, and thus
show up repeatedly within either a single genome or across the genome of a
wide variety of species. As a result, sequence comparison is the most
commonly used method for inferring structure and biological function. Of
course, sequence can have similar structure and function without exhibiting
sequence similarity.
iii
Sequence comparison is also the first step for many problems in computational
biology, such as evolutionary tree reconstruction, genome analysis, and
classification of viruses that belong to specific family of viruses.
One of the most important methods to understand the life cycle of viruses is to
construct mathematical model describing through it the six steps lytic cycle of
a virus.
iv
CONTENTS
Subject
Page
Acknowledgment………………………………………………………………….
i
Publications……………………………………………………………………….
ii
Summary………………………………………………………………………….
iii
Contents…………………………………………………………………………...
vii
List of abbreviations……………………………………………………………...
x
List of figures……………………………………………………………………...
xi
List of tables………………………………………………………………………
Xiv
CHAPTER ONE
FUNDAMENTAL CONCEPTS
Biological Background…………………………………………………………..
1
Introduction……………………………………………………………………….
1
Bioinformatics cycle………………………………………………………………
2
Nucleic Acids……………………………………………………………………...
3
DNA………………………………………………………………………………...
4
RNA………………………………………………………………………………...
5
Protein……………………………………………………………………………...
6
Sequence Analysis…………………………………………………………………
8
Sequence Alignment………………………………………………………………
8
Pairwise Sequence Alignment…………………………………………………….
9
Multiple Sequence Alignment…………………………………………………….
11
Basic Definitions on String……………………………………………………….
12
Motivations…………………………………………………………………………
13
Objectives…………………………………………………………………………...
13
v
CHAPTER TWO
SEQUENCE ALIGNMENT ALGORITHMS
Pairwise Sequence Alignment by Dynamic Programming……………………...
14
Global Sequence Alignment……………………………………………………….
15
Global sequence alignment algorithm (Needleman and Christian Wunsch
Algorithm)…………………………………………………………………………...
22
Local sequence alignment………………………………………………………….
24
Local sequence alignment algorithm (Smith and Waterman Algorithm)………..
25
Gene Tracer algorithm…………………………………………………………….
27
Multiple Sequence Alignment algorithms………………………………………...
29
CHAPTER THREE
USING SEQUENCE ALIGNMENT ALGORITHMS TO
DEMONSTRATE THE GENETIC EVOLUTION OF HEAT SHOCK
FACTOR 1 (HSF1) IN DIFFERENT EUKARYOTIC ORGANISMS
Overview……………………………………………………………………………
31
Introduction………………………………………………………………………...
31
Heat shock factor protein………………………………………………………….
33
Biological data used in our work………………………………………………….
34
Results and discussion……………………………………………………………..
38
Pairwise global alignment among the common conserved
domains
(HSF_DNA-bind) and also among the entire sequence of HSF1 protein……..
38
Multiple sequence alignment (MSA) among all the conserved domains
(HSF_DNA-bind) in HSF1 protein sequences and also among all the entire
sequences of HSF1 protein………………………………………………………...
46
Using Gene Tracer algorithm to specify some results obtained in table3.1and
table3.2……………………………………………………………………………...
54
Conclusion…………………………………………………………………………..
57
vi
CHAPTER FOUR
MATHEMATICAL MODELING AND CLASSIFICATION OF
VIRUSES FROM HERPESVIRUS FAMILY
Overview……………………………………………………………………………
58
Introduction to viruses……………………………………………………………..
58
Enveloped DNA viruses……………………………………………………………
59
Herpesviridae (herpesviruse) family……………………………………………...
60
Biological data used in our work………………………………………………….
61
Mathematical modeling of herpesvirus family life cycle………………………...
61
Results and discussion……………………………………………………………..
72
Pairwise global alignment among the entire sequences of herpesvirus family…
72
Multiple sequence alignment (MSA) among the entire sequences of the
capsid protein of herpesvirus family…………………………………………….
88
Conclusion…………………………………………………………………………..
90
CHAPTER FIVE
CONCLUSION AND OUTLOOK
Conclusion…………………………………………………………………………..
92
Outlook ……………………………………………………………………………..
93
References…………………………………………………………………………
94
vii
LIST OF ABBREVIATIONS
Deoxyribonucleic Acid
DNA
Ribo Nucleic Acid
RNA
Ribosomal Ribo Nucleic Acid
rRNA
Messenger Ribo Nucleic Acid
mRNA
Transfer Ribo Nucleic Acid
tRNA
Adenine
A
Guanine
G
Cytosine
C
Thymine
T
Uracil
U
Dynamic Programming
DP
Needleman-Wunsch
NW
Smith-Waterman
SW
Multiple Sequence Alignment
MSA
Blocks Substitution Matrix
BLOSUM
Percent Accepted Mutation
PAM
Heat Shock
HS
Heat Shock Factor
HSF
National Center for Biotechnology Information
NCBI
Uni protein knowledge database
UniProtKB
Herpes simplex virus1
HSV1
Herpes simplex virus2
HSV2
Epstein barr virus
EBV
Human herpesvirus6
HHV6
Human herpesvirus7
HHV7
Human herpesvirus8
HHV8
Varicella-zoster virus
VZV (HSV3)
Human cytomegalo virus
HCMV (HHV5)
viii
LIST OF FIGURES
Fig. No.
Fig. 1.1
Subject
Page
The use and development of mathematical algorithms and
computer programs to obtain insight into biological and
2
medical systems
Fig. 1.2
Bioinformatics cycle
3
Fig. 1.3
DNA Structure
4
Fig. 1.4
RNA Structure
5
Fig. 1.5
Different Graphical Representations of Proteins
6
Fig. 1.6
From DNA to Protein (central dogma of molecular biology)
7
Fig. 1.7
Dot
matrix
comparison
of
sequences
s1 =tgagaaaatgctttagcacggctgg and s2 = aaatgctttgagcac
9
Fig. 1.8
Filtered
dot
matrix
comparison
of
sequences
s1 = tgagaaaatgctttagcacggctgg and s2 = aaatgctttgagcac
10
Fig. 2.1
Matrix sim for scoring alignment of s = AAAC and t = AGC
18
Fig. 2.2
Local Alignment Example
26
Fig. 2.3
Gene Tracer Function
28
Fig. 3.1
Structure of Heat Shock Factor Protein 1 Dna Binding Domain
From Homo Sapiens
33
Fig. 3.2
Results of sequence dot plot matrix among the conserved
domains (HSF_DNA-bind) in HSF1 protein sequences
41
ix
Fig. No.
Fig. 3.3
Subject
Page
Results of sequence dot plot matrix among the entire sequences
of HSF1 protein
45
Graphical summary of the conserved domains of HSF1 protein
Fig. 3.4 in the different organisms human, Danio rerio, Taurus, mouse,
yeast, and plant
This figure shows the similarity between the conserved
Fig. 3.5 domains in HSF1 protein sequences in all the last species in
case of MSA by using the scoring matrix blosum60
This figure shows the similarity between the conserved
Fig. 3.6 domains in HSF1 protein sequences in all the last species in
case of MSA by using the scoring matrix blosum80
This figure shows the similarity between the conserved
Fig. 3.7 domains in HSF1 protein sequences in all the last species in
case of MSA by using the scoring matrix PAM10
This figure shows the similarity between all sequences of the
Fig. 3.8 last species in case of HSF1 protein sequence by using the
scoring matrix blosum60
This figure shows the similarity between all sequences of the
Fig. 3.9 last species in case of HSF1 protein sequence by using the
scoring matrix blosum80
This figure shows the similarity between all sequences of the
Fig. 3.10 last species in case of HSF1 protein sequence by using the
scoring matrix PAM1
47
48
49
50
51
52
53
Fig. 3.11
Results of Gene Tracer program in the case of the conserved
domain sequence (HSF_DNA-bind)
55
Fig. 3.12
Results of gene tracer program in the case of the entire
sequence of HSF1 protein sequence
56
Fig. 4.1
Virus Structure components
58
Fig. 4.2
General DNA Virus structure
59
Fig. 4.3
Simple diagram represent herpesviruses Lytic Cycle
61
Fig. 4.4
The solution to system of equations produced by ODE45
63
x
Fig. No.
Subject
Page
Fig. 4.5
Results of sequence dot plot matrix among the entire sequences
of their capsid protein
86
Fig. 4.6
Results of MSA among the entire sequences of herpesvirus
family by using the scoring matrix BLOSUM60
88
Fig. 4.7
Results of MSA among the entire sequences of herpesvirus
family by using the scoring matrix BLOSUM30
89
Fig. 4.8
Results of MSA among the entire sequences of herpesvirus
family by using the scoring matrix PAM10
89
Fig. 4.9
This figure represent the classification of herpesvirus family
according to the structure of their capsid protein
90
xi
LIST OF TABLES
Table No.
Table 2.1
Table 3.1
Table 3.2
Subject
A simple scoring function
Page
15
Results of pairwise global alignment in case of the conserved
domains (HSF_DNA-bind)
Results of pairwise global alignment in case of the entire
length of the HSF1 sequences
42
46
Table 3.3
Results of gene-tracer algorithm
54
Table 4.1
The modeling steps symbols
62
Table 4.2
Results of pairwise global alignment of the entire length of
herpesviruse family sequences
87
xii
CHAPTER ONE
FUNDAMENTAL CONCEPTS
In this chapter we introduce the fundamental concepts that are needed to study
sequence alignment problem in bioinformatics. This chapter consists of six sections In
Section 1.1 we shed light on the fundamental facts of molecular biology. In Section 1.2 we
introduce a brief description of sequence analysis problem. In section 1.3 we shed light on
the fundamental facts of sequence alignment (pairwise and multiple sequence alignment).
After that in section 1.4 we introduce some basic definitions on sequence and string.
Finally in both sections 1.5 and 1.6 the motivations of sequence alignment studying, and
the objectives of our thesis are illustrated respectively.
1.1.
Biological Background
1.1.1.
Introduction
Bioinformatics science is a mixture of many other sciences such as biology, statistics,
computer science, and mathematics as in figure 1.1 to process biological data. Databases
and information systems are used to store and organize biological data. Analyzing
biological data may involve algorithms in artificial intelligence, soft computing, data
mining, image processing, and simulation. Biological data may be nucleic acids
(Deoxyribonucleic acid "DNA" or Ribonucleic acid "RNA") sequences or protein
sequences.
1
Figure: 1.1. The use and development of mathematical algorithms and computer programs
to obtain insight into biological and medical systems.
1.1.2.
Bioinformatics cycle
Bioinformatics cycle consists of three stages as shown in figure 1.2. In the first stage,
biological data, which represent information’s, are extracted from biological experiments
in life science. In the second stage biological data are stored in biological database. In the
third stage several operations may be performed on biological sequences that retrieved
from database such as statistical analysis, visualization, prediction and modeling using
information science techniques. The ultimate goal of statistical bioinformatics is to
statistically identify significant changes in biological processes for the purpose of
answering biological questions. For example if the biologist need to determine the genetic
evolution that happen in a specific gene in different species sequence alignment algorithms
can be used for this purpose as it shown in chapter three.
2
Figure: 1.2. Bioinformatics cycle.
1.1.3.
Nucleic Acids
DNA and RNA are usually collectively referred to as nucleic acids since in
eukaryotic cells, DNA and RNA occur predominantly in the nuclei. Nucleic acid is
composed of chains of nucleotides. In DNA, there are two chains (strands) forming a
double helix, see Figure 1.3, while in RNA, there is only one strand, see Figure 1.4 [1].
The building unit of either DNA or RNA is the nucleotide. Each nucleotide consists of
three components: a nitrogenous base (nucleobase), a pentose sugar (deoxyribose in case
of DNA and ribose in case of RNA) and a phosphate group as shown in figure 1.3. Special
attention will be given to the nucleobase component. There are two kinds of nitrogenous
bases: purines and pyrimidines. Two types of purines and three types of pyrimidines are
3
commonly found in nucleic acids. The two purines are adenine and guanine abbreviated A
and G. The three pyrimidines are cytosine, thymine and uracil abbreviated C, T, and U.
Both DNA and RNA contain A, C, and G; only DNA contains the base T, whereas only
RNA contains the base U.
1.1.3.1. DNA
The DNA contains the basic genetic information that each organism needs to live
and reproduce. In DNA there is always hydrogen bond between adenine (A), and thymine
(T) also between cytosine (C) and guanine (G). The complete set of DNA that contains the
entire hereditary information of an organism is called the genome of the organism [2]. The
nucleotides of the DNA sequence can be roughly divided into two groups. The first group
consists of all nucleotides building the genes, and the second group is the junk DNA,
nucleotides that have no or unknown functions.
Figure: 1.3. DNA Structure.
4
1.1.3.2. RNA
The structure of RNA is similar to DNA, with two important differences the first
difference is that DNA consist of a double chains (strands) forming double helix, but RNA
consist of one chain, the second difference is that instead of the connection between
thymine (T) and adenine (A) in DNA it become between adenine (A) and Uracil (U) in
RNA , this means that RNA chain consist of the four bases adenine (A), Uracil (U),
cytosine (C) and guanine (G).
Figure: 1.4. RNA Structure.
5
1.1.4.
Protein
A protein is a polypeptide chain polymer, which formed out of twenty different kinds
of amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, M, Y). This molecule
exists in a specifically folded structure that is necessary for its function. Thousands of
types of proteins are present in the different cells of an organism, each protein with a
distinct amino-acid sequence and 3-dimensional structure, see Figure 1.5, [3].
Figure: 1.5. Different Graphical Representations of Proteins.
6
The process of generating protein from DNA is called gene expression it occurs into
two major steps. The first is transcription, where the information coded in DNA is copied
into a molecule of RNA whose bases are complementary to those of the DNA. The second
is translation, where the information now encoded in RNA is translated into instructions
for manufacturing a protein utilizing the ribosome protein machine, this process is shown
in figure 1.6.
Figure: 1.6. From DNA to Protein (central dogma of molecular biology)
7
1.2.
Sequence Analysis
One of the most important operations in bioinformatics is sequence analysis. It
means analyzing DNA or protein sequence to understand their features, functions, and
structures. Sequence alignment or sequence comparison is one of the most important
methods in performing sequence analysis process.
1.3.
Sequence Alignment
At first we need to know what we mean by sequence alignment. By writing the
sequences of two genes as a strings of characters, with one string above the other, we can
determine at which positions the strings do or do not match this process is called
alignment
or
sequence
alignment.
Sequence
alignment
involves
establishing
correspondences between bases of DNA or RNA strings or between amino acids forming
linear sequences in proteins. Aligning DNA, RNA or amino acid sequences is of basic
importance in bioinformatics and can be used for a variety of research purposes. It can find
similarity between two DNA sequences resulting from the existence of a recent common
ancestor, which these two sequences originate from, the process of aligning two sequences
is called pairwise alignment. By measuring or computing distances between the aligned
sequences, one draws inferences about the evolutionary processes they have gone through.
This inference about the evolutionary process may involve estimating the time that has
passed from the common ancestor to the present, but may also involve stating hypotheses
or reconstructing a single evolutionary event in the past or a sequence of them. Aligning
two sequences can allow one to detect their overlap or to notice that one sequence is a part
of the other or that the two sequences share a subsequence. Instead of two sequences, one
can also align many sequences or match a sequence against a DNA, RNA, or protein
database, this process is known as multiple sequence alignment (MSA).
8
1.3.1.
Pairwise Alignment
Pairwise sequence alignment methods are used to find the best-matching piecewise
(local) or global alignments of two query sequences. Pairwise alignments can only be used
between two sequences at a time, but they are efficient to calculate and are often used for
methods that do not require extreme precision (such as searching a database for sequences
with high similarity to a query).
The four primary methods of producing pairwise alignments are:
1. By hand methods (slide sequences on two lines of a word processor).
2. Dot-matrix or dot plot method which implicitly produces a family of
alignments for individual sequence regions. The dot matrix or dot plot is a
simple and very useful concept for aligning two sequences. Assume that the
sequences to be aligned are
s1 = tgagaaaatgctttagcacggctgg
and
s2 = aaatgctttgagcac.
Form a rectangular n × m matrix with rows corresponding to the characters in the first
string s1 and columns corresponding to the characters in the second string s2, such that the
order of characters is to the right and down. Place a dot in each matrix entry, where a base
from s1 matches a base from s2. The result, shown in Fig 1.7 is called a dot matrix.
Figure: 1.7. Dot matrix comparison of sequences s1 =tgagaaaatgctttagcacggctgg
and s2 = aaatgctttgagcac. The dots show possible correspondences between the
characters of the strings s1 and s2
9
The dots show possible correspondences between the characters of the strings s1 and s2. In
a dot plot each diagonal corresponds to a possible (ungapped) alignment. There are many
dots related to accidental matches between letters of the two strings. We can eliminate
some of these by removing dots unlikely to represent a nonrandom correspondence
between characters of the strings s1 and s2 with the use of some intuitive criterion. If we
introduce the requirement that, in order that a dot is not removed, there must be at least
k neighboring matches along the right-down diagonal direction, then this will result in
some of the random accidental matches being filtered out. If k is too small, many
accidental matches will remain in the dot matrix plot. On the other hand, if it is too large,
some of the true correspondences between strings may be unintentionally omitted. If we
take k = 3 we obtain the filtered dot matrix shown in Fig 1.8 which is much easier to
interpret than the original one.
Figure: 1.8. Filtered dot matrix comparison of sequences
s1 = tgagaaaatgctttagcacggctgg and s2 = aaatgctttgagcac. The dots are
now arranged in diagonal paths, which more clearly show the possible
correspondences between the characters of the strings s1 and s2
.
10
From the filtered dot matrix, can we construct the following alignment between s1 and s2
s1 =
aaatgctttgagcac
::::::::::: ::::::
s2 = t g a g a a a a t g c t t t − a g c a c g g c t g g
Using dot matrices is rather intuitive, since the alignment is performed by
following long lines of dots in the plot. Nevertheless, there is a scoring system behind it.
For example, we may assign a score of 1 for every single match between letters of strings,
and we should not introduce indels unless it results in a large enough number of new
scores. We should also penalize correspondence between mismatching symbols.
3. Rigorous mathematical methods (Dynamic Programming), (slow but optimal) in
this approach we use two important algorithms the first is called NeedlemanWunsch algorithm (for global pairwise alignment), the second is SmithWaterman algorithm (for local pairwise alignment). These two algorithms will
be discussed in details in chapter two.
4. Heuristic algorithms (faster but approximate):
a. BLAST
b. FASTA
1.3.2.
Multiple Sequence Alignment
Multiple sequence alignment is an extension of pairwise alignment to incorporate more
than two sequences at a time. Multiple alignment methods try to align all of the sequences
in a given query set. Multiple alignments are often used in identifying conserved sequence
regions across a group of sequences hypothesized to be evolutionarily related. Alignments
are also used to aid in establishing evolutionary relationships by constructing phylogenetic
trees.
11
1.4
Basic Definitions on String
As discussed earlier, DNA, RNA, and protein are represented as strings in bioinformatics.
Therefore, in this section, we give some basic definitions that are needed in string
processing [4].
1. Alphabet - set of allowable symbols. Examples of biosequence alphabets:
Σ= {A, C, G, T} (DNA)
Σ= {A, C, G, U} (RNA)
Σ= {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} (Proteins)
2. Sequence (or string) - finite succession of characters chosen from an alphabet
e.g., ATCCGAACTTG from the DNA alphabet Σ= {A, C, G, T} .
3. Subsequence - sequence obtained from a sequence by removal of characters
e.g., TTT is a subsequence of ATATAT
AAAA is not a subsequence of ATATAT
4. Substring - subsequence of consecutive characters e.g.,
TAC is a substring of ACTACA
TAC is not a substring of ATGAC
5. Length of string[5] - is the number of characters in the string. It could be any
non-negative integer. For example, if Σ = {A, C, G, T}, then ACGATGGGT
is a string over Σ with length equals 9.
6. Prefix - substring containing the first character of a string, including the
empty string
e.g., in the string ACT, a prefix may be the empty string, A, AC, or ACT
7. Suffix – is a substring containing the last character of a string (includes the
empty string).
e.g., in the string TAC, a suffix may be the empty string, C, AC,
TAC.
12
1.5 Motivations
Our motivations of studying sequence comparisons are due to its importance of these
comparisons in molecular biology. These comparisons are important for a number of
reasons [6]:
1. They can be used to establish evolutionary relationships among different
organisms.
2. Comparisons may allow Prediction of protein function and structure.
3. Such comparisons between humans and other species may identify corresponding
genes in model organisms, which can be genetically manipulated to develop
models for human diseases.
1.6
Objectives
The objective of this thesis is using sequence alignment algorithms in some
applications in molecular biology such as demonstrating the genetic evolution of a certain
protein called heat shock factor1 (HSF1) and also the genetic evolution of conserved
domains in this protein in different eukaryotic organisms, also classifying viruses that
belong to one of the most important family in the families of the enveloped DNA viruses
called herpesviruse family as it contain many dangerous viruses for human health.
Mathematical model for the herpesviruses virus general life cycles also introduced.
13
CHAPTER TWO
SEQUENCE ALIGNMENT ALGORITHMS
In this chapter we explain in details the algorithms that are used in our
applications. This chapter consists of three sections. In Section 2.1 we discuss in
details with examples how we can make Pairwise Sequence Alignment (global and
local) by Dynamic Programming by using both algorithms Needleman And Christian
Wunsch Algorithm (NW) for global alignment and Smith and Waterman Algorithm
(SW) for local alignment. Section 2.2 discusses an algorithm based in Smith and
Waterman Algorithm called gene tracer algorithm. Finally in section 2.3 we discuss the
multiple sequence alignment and CLUSTALW algorithm.
2.1.
Pairwise Sequence Alignment by Dynamic Programming
As we mentioned before in section 1.3.1 there are more than one method for performing
Pairwise Sequence Alignment, in this thesis we will focus on DP (Dynamic Programming)
methods which consumes long time of execution but gives highly accurate alignment. In
mathematics, computer science and economics, dynamic programing is a method for
solving complex problems by breaking them down into simpler sub problems. The idea
behind dynamic programming is to solve these sub problems then combine the solutions of
the sub problems to reach the overall solution. Often, many of these sub problems are
really the same. The dynamic programming approach seeks to solve each sub problems
only once, thus reducing the number of computations, once the solution to a given sub
problems has been computed. It is stored or "memorized" the next time the same solution
is needed it's simply looked up. This approach is especially useful when the number of
repeating sub problems grows exponentially as a function of the size of the input. There
are two important algorithms used in DP pairwise sequence alignment:
1.
Needleman and Christian Wunsch Algorithm (for pairwise global sequence
alignment).
2.
Smith and Waterman algorithm (for pairwise local sequence alignment).
14
2.1.1. Global Sequence Alignment
Pairwise global sequence alignment is the process of aligning all the entire length of
two sequences. The global alignment problem is to find the best alignment according to
some scoring function that measures similarity.
Given two strings of different length, an alignment makes these sequences the same length
through the insertion of gaps. Gaps may be added anywhere within the sequences,
including at the beginning or the end, but cannot be aligned together.
As an example, take two strings, CGACCTA and CGCCTA. One alignment is as follow
CGACCTA
CG─CCTA
Here we inserted a space (or gap) to generate a good alignment.
The simple scoring function in table2.1 can be used to calculate the total score of this
alignment
Column
Score
Match
+1
Mismatch
-1
Gap
-2
Table 2.1: A simple scoring function
15
In this simple scoring function, a column containing two identical characters (a match)
would receive a score of 1, a column containing two different characters (mismatch)
would receive a score of -1, and a column containing a gap would receive a score of -2.
This simple scoring function can be used to form what is known as substitution matrix.
Substitution matrix is a convenient way of representing many scoring functions. In general
it used to show the cost of replacing one letter (of either a nucleotide or amino acid
alphabet) with another letter or a gap. Substitution matrices can be represented without the
gap character, but if we need them in the alignment process we can include a column and
row for gaps.
According to the simple scoring function in table 2.1 we can form the following 5 x 5
Substitution matrix for this example
─
A
C
G
T
A
+1
-1
-1
-1
C
-1
+1
-1
-1 -2
G
-1
-1
+1
-1
-2
T
-1
-1
-1
+1
-2
─
-2
-2
-2
-2
-2 N/D
Note that the value of aligning gap with a gap is not defined- because we don’t need to
align gaps with each other in this example.
Using the simple scoring function, in the previous example then the total score is
1 +1 -2 +1 +1 +1 +1 = 4
But in the following example the total score is 1 +1 -2 -1 +1 +1 = 1
C
G A
C
C
T
C
G ─
C
C
T
1
1
-1
1
1
-2
16
The optimal global alignment of two strings is defined as the alignment that maximizes the
total alignment score over all possible alignments [7] and it can be found through the
following steps:
1.
Enumerate all possible alignments.
2.
Score each alignment.
3.
Choose the alignment with the highest score.
The problem of this approach is that the number of possible alignments is prohibitively
large (exponential in the length of the sequences).
Note that because nucleotides differ very little in biochemical functions, simple scoring
functions are often used for DNA alignment. Amino acids, on the other hand, can be quite
different from one another, and mismatches can be of varied effect depending on how
similar or dissimilar amino acids are in their biochemical properties. Scores based on
inferences about chemical or physical properties of proteins are possible and useful. It is
well known that certain pairs of amino acids are much more likely to substitute for each
other during evolution than others. This is likely due to certain physicochemical properties
that they have in common, such as their hydrophobicity, size, or electrical charge. A good
alignment should consider this and incorporate it into the scoring function so that the
overall alignment reflects the biological similarity between sequences more closely. The
two most common scoring functions that do this are based on observed substitution
frequencies in proteins, and are called PAM (Percent Accepted Mutations) matrices [8],
and BLOSUM (Blocks Substitution Matrix) matrices [9].
Fortunately, there exists an algorithm depend on dynamic programming called Needleman
and Christian Wunsch Algorithm that computes the best alignment in O(mn) time, where
m and n is the length of the two sequences.
We will first demonstrate Needleman and Christian Wunsch Algorithm through an
example.
17
Example:
Suppose we want to know the score of the best alignment of s = AAAC and t = AGC using
our simple scoring function.
Notation: let s(i) and t(j) denote the ith and jth characters of s and t respectively.
Considering just the last column of the alignment, we have only three possibilities:
1.
The last character of s(C) is aligned with the last character of t(C). in this case
the score of the best alignment of s and t is equal to the score of the best
alignment of AAA (the remaining portion of s) and AG (the remaining portion
of t), plus 1 for matching the last character.
2.
The last character of s(C) is aligned with a gap. In this case the score of the best
alignment of s and t is equal to the score of the best alignment of AAA (the
remaining portion of s) and AGC (all of t), minus2 for inserting a gap.
3.
The last character of t(C) is aligned with a gap. In this case the score of the best
alignment of s and t is equal to the score of the best alignment of AAAC (all of
s) and AG (the remaining portion of t), minus 2 for inserting a gap.
If we know the answers to the three sub problems mentioned above, then we will know the
score of the best alignment between s and t. note that the sub problems consist of aligning
prefixes of s and t. we will find and save optimal solutions for all prefixes of s and t,
building up from shorter ones to longer ones. There are five prefixes for s:empty, A, AA,
AAA, and AAAC, and we will refer to these prefixes as the 0th, 1st, 2nd, 3rd, and 4th
prefixes of s likewise there are four prefixes for t: empty, A, AG, and AGC. The algorithm
uses a matrix representation (in this case a 5x4 matrix), with characters of s along the rows
and characters of t along the columns as shown in the figure 2.2
Figure: 2.1. Matrix sim for scoring alignment of s = AAAC and t = AGC.
18
We will define sim(i, j) to correspond to the optimal alignment score (the "similarity") of
the ith prefix of s with the jth prefix of t. thus, the matrix reflects the similarity scores for
all prefixes of s and t.
In general, when we are aligning sequence s of length m and sequence t of length n, sim(m,
n) has the answer we are looking for, and we will fill out the entire matrix in order to get
this last score. Each element is determined by our sim function, which takes the following
form:
sim (i  1, j  1)  1,


sim (i , j )  max 
sim (i  1, j )  2,

sim (i , j  1)  2,
align s (i ) with t ( j ),
+1 for a match, -1 for mismatch
align s (i ) with agap
align t ( j ) with agap
In our array in order to compute sim(i, j), we need to have three entries precomputed:
sim(i, j), sim(i-1, j), sim(i, j-1). If we compute entries row by row left to right, we will
always have things computed when we need them.
We start by filling the 0th row and column: using our scoring function, an alignment of a
single letter string with a gap result in a score of -2; similarly, alignment of a two letter
string with a gap results in a score of -4. In general, the alignment of a string of i letters
with a gap gives a score of -2i and the 0th row and column may thus be filled in
accordingly (see figure 2.1).
Note that aligning of a gap with a gap contributes nothing to the alignment of the strings,
so the score of sim(0, 0) is zero.
Now we have all the information we need to evaluate array element (1 , 1): sim(1, 1) is the
alignment of A and A according to the function:
sim (0, 0)  1,

sim (1, 1)  max sim (0,1)  2,
sim (1, 0)  2,

align s (1) with t (1), (= 0 + 1 = 1)
align s (1) with agap (= -2 - 2 = -4)
align t (1) with agap (= -2 - 2 = -4)
19
The maximum is found at sim(0, 0) and evaluate to 1. We place 1 in our array element
and, since the maximum came from element (0, 0), we keep track of this (by "drawing" an
arrow pointing to that array element). See figure 2.1. We have the information required to
evaluate sim(1, 2) in the same manner:
sim (0, 0)  1,

sim (1, 2)  max sim (0, 2)  2,
sim (1,1)  2,

align s (1) with t (2), (= -2 - 1 = -3)
align s (1) with agap (= -4 - 2 = -6)
align t (2) with agap (= 1 - 2 = -1)
The maximum is -1 (fill this value in the matrix) and it came from the element (1,1), so
draw an arrow pointing to the left, toward element (1 , 1).
The array element sim(1, 3) can be obtained as follow:
sim (0, 2)  1,

sim (1, 3)  max sim (0,3)  2,
sim (1, 2)  2,

align s (1) with t (3), (= -4 - 1 = -5)
align s (1) with agap (= -6 - 2 = -8)
align t (3) with agap (= -1 - 2 = -3)
The maximum is -3 (fill this value in the matrix) and it came from the element (1,2), so
draw an arrow pointing to the left, toward element (1,2).
sim (1, 0)  1,

sim (2, 1)  max sim (1,1)  2,
sim (2, 0)  2,

align s (2) with t (1), (= -2 + 1 = -1)
align s (2) with agap (= 1 - 2 = -1)
align t (1) with agap (= -4 - 2 = -6)
The maximum is -1 (fill this value in the matrix) and it came from both the elements (1,0)
and (1,1) so draw an arrow pointing upward, toward element (1,1) and an arrow pointing
diagonally toward the element (1,0).
Continuing with this process we obtain the full matrix (with arrows) depicted in the
previous figure. The final alignment score is in element (4, 3) and is -1.
20
Now the question is how do we construct the alignment itself? This is where the arrow
come in. start with the final array element and follow the arrow back. An arrow from (i,j)
pointing to element (i-1,j-1) (to the diagonally upper left) means to align s(i) and t(j) with
each other. An arrow pointing upward to (i-1, j) mans to align s(i) with a gap, and an
arrow pointing to the left to (i, j-1)means to align t(i) with a gap. Continuing through the
three possible arrow paths, we are able to build three possible alignments (with the same
scores):
1)
AAAC
total score = sum of the individual scores at each position
─AGC
= -2 +1 -1 +1 = -1
2)
AAAC
total score = sum of the individual scores at each position
A─GC
= +1 -2 -1 +1 = -1
3)
AAAC
total score = sum of the individual scores at each position
─ AGC
= -2 +1 -1 +1 = -1
Each of these alignments is equally good (two match, one mismatch, one align with a gap).
Note that we always recorded the score for the best path into each element. There are paths
through the matrix corresponding to very bad alignments.
For example, the alignment corresponding to moving down through the first column then
right through the last row is
AAAC - - - - -
total score = -2 -2 -2 -2 -2 -2 -2 = -14
AGC
21
2.1.1.1. Global sequence alignment algorithm (Needleman and
Christian Wunsch Algorithm)
Our problem in pairwise global sequence alignment is when dealing with long sequences
because the problem of finding the optimal alignment from all possible alignments become
more difficult because The number of different alignments (with gaps) of two sequences
of length n is
2n

,
 n 


a quantity which grows exponentially with n. This means that for two
sequences of length 30, there are approximately10 17 possible alignments between them, so
that the naïve approach would have an exponentially high cost when calculate the scores
for all possible alignments. This problem can be solved efficiently by using Needleman–
Wunsch algorithm [10] There are other dynamic programming algorithms for pairwise
global alignment such as Huang and Chao [11] and NGILA [12]. This remarkable
algorithm is guaranteed to find the optimal score for any given symbol-scoring function in
feasible time. This algorithm can find the best alignment without enumerating all
alignments where it works depending on dynamic programming DP (DP is explained in
details in section 2.1). DP methods – such as the Needleman–Wunsch (NW) algorithm –
allow us to start the computation by aligning very short DNA sequences, and growing this
alignment efficiently to the full length of the two sequences. When implemented well, this
approach has a much lower computational cost than the naïve solution. There are three
elements to DP algorithms in sequence alignment: a recursive relation, a tabular
computation, and a trace-back procedure.
22
Needleman and Christian Wunsch Algorithm can be summarized as
follows:
1. First we consider any two strings such as
A = ai
i=1,2,…,n
B = bj
j=1,2,…,m
2. Alignment matrix of size (n+1)*(m+1) is constructed
(
)
score of aligning
(
)
with
{
(
)
−µ and –δ refer to the score of mismatch and indel respectively.
Indel means insertions or deletions and it refer to the alignment of a gap
with a character and it can be represented by using the symbol"−".
3. The score for elements in the first row and column of the alignment
matrix
S i ,0  i  ,
are
given
by
S 0, j   j 
4. Starting from top left, compute each entry in the alignment matrix using
the formula:
sim (i  1, j  1)  1 or   , align ai with b j ,

+1 for a match, - for mismatch

sim (i , j )  max 
sim (i  1, j )   , align ai with agap
sim (i , j  1)   , align b j with agap

Such that sim(i, j) is the optimal alignment score, but ai and bj denote the
ith and jth characters of A and B respectively.
5. Perform tracing back element by element along the path that yielded the
maximum score into each matrix element.
23
Needleman–Wunsch (NW) algorithm takes O(mn) time and O(mn) space there
is also a space saving version of the algorithm that takes O(m+n)space but still
work in O(mn) time[4].
2.1.2.
Local sequence alignment
We knew that proteins are built of modules that can be mixed and matched. Thus, the
global similarity may be poor for some sequences, but we may want to look for local
regions of similarity that suggest shared structural or functional subunits.
Definition: a local alignment of strings s and t is an alignment of a substring of s and a
substring of t.
The best alignment of substrings of s and t is called the optimal local alignment. This can
be done through the removing of a prefix and a suffix in each of the two sequences, and
testing how well we can align the remaining internal substrings.
As an example may want to find similar substrings within the sequences s =
QUEVIVALASVEGAS and t = VIVADAVIS. This could be accomplished by computing
the best (global) alignment between all subsequences in s and all subsequences in t, each
subsequence being defined by ignoring a prefix and a suffix in the original sequence. A
possible (but not optimal) local alignment is
V I V A L A S V
| | | | | |
|
V I V A D A -V
Where a prefix and suffix have been removed from the original sequences. For clarity we
show the subsequences and their prefixes/suffixes in an alignment like representation:
Q U E V I VALASV E G A S
R R R
R R R R
- - - V I VADA-V - - I S
24
where we denote by "R" the removed parts. Note that the optimal local alignment is the
one presented in the alignment containing only the subsequences without their prefixes
and suffixes above.
Local alignment appears to be harder than global alignment, since it contains many
instances of global alignment within it: we are not only optimizing over all possible
alignments, but over all possible choices of starts and ends of each substring. However, a
very clever adaptation of the Needleman–Wunsch algorithm, called the Smith–Waterman
(SW) algorithm, makes it possible to perform local alignment with the same cost as global
alignment. The keys to local alignment are to use a slightly more complex scoring
function, and to use a different method for reading the desired alignment from the table.
2.1.2.1
Local sequence alignment algorithm (Smith and Waterman
Algorithm)
The local alignment algorithm (or smith and Waterman Algorithm) [13] involves a very
simple modification to the global alignment algorithm. We will again use a (m+1)x(n+1)
matrix sim. However this time the (i, j)th entry of the matrix sim will hold the optimal
alignment between a suffix of the ith prefix of s and a suffix of the jth prefix of t.
The basic intuition is that a suffix of a prefix is a substring, and sim(i, j) holds the best
alignment score between substrings that ends in s(i) and t(j). Suppose that we want the best
alignment between substrings of s ending in s(i) and a substring of t ending in t(j). This
means that we need to pick the best starting points for the strings, since their endpoints are
determined. Note that s(i) is not necessarily aligned with t(j).
We will fill in the sim matrix. A small addition to the global alignment algorithm solves
this problem. First we zero out the first row and column of sim, because there are no initial
gaps in the best local alignment.
Then we use the rule:
sim (i  1, j  1)  1
sim (i  1, j )  2

sim (i , j )  max 
sim (i , j  1)  2

0
25
Where the sign for the first case is chosen as before, according to whether s(i) matches t(j)
or not. As we fill in the array, also a zero in the fourth case is used when the optimal score
become negative.
From above we can say that local alignment differs from the global alignment in the
following:
1. The starting point of the alignment in local alignment occurs at the largest score
in the scoring matrix and we stop when hitting a zero, while in global alignment
we start from the lower right cell and end at the upper left cell.
2. In local alignment if the lowest value in the scoring matrix become negative then
we set it to zero, but in global alignment it may be negative value (i.e. less than
zero).
See figure 2.2 for an example of the algorithm at work, in a local alignment of strings s =
AGCT and t = GCA.
Figure: 2.2. Local Alignment Example
26
To construct an optimal local alignment, we look for the maximum value in the matrix.
Then we follow its arrows back until we hit a zero. In figure 2.2 we choose the value 2 in
entry sim(3,2). This gives the local alignment
s: GC
t: GC
If we want further optimal alignments, we take another maximum value in the matrix,
generally one that has not already been visited by a previous path.
If we want the second best local alignment, we take the second highest value in the matrix
that has not already been visited. These alignments are sometimes useful and are called
near-optimal alignments. In our case, if we start from sim(1,3), we obtain the local
alignment
s: A
t: A
this algorithm uses time O(mn) and space O(mn). Again it is possible to improve it to use
space O(m+n) [4].
2.2
Gene Tracer algorithm
In this section, we present a proposed algorithm, called Gene Tracer [14] that is used to get
relations between sequences to determine homology of them (genes or characters B and C
that evolved from the same ancestral gene or character A are said to be homologs [6]).
This algorithm is used in application such as determining originality or functionality of
sequences. In other words gene tracer is used to trace genes modification from ancestor
sequences through offspring sequence. It tracks down genes modification in the ancestor
sequences and finds related parts of each ancestor sequence in the offspring one.
Gene Tracer can find precisely the location of the ancestor sequences contribution inside
the offspring one and gives statistical results that express the relationship between the two
ancestor sequences and their offspring one as shown in figure 2.4.
27
Figure: 2.3. Gene Tracer Function
Gene Tracer’s inputs are two ancestor sequences and one offspring sequence. Gene
Tracer’s output is as shown in figure 2.3 the output is three sequences have colored parts
which represent the matching parts. Longest matching parts common between ancestor 1
and offspring sequence are given a red color with length L and the total length of ancestor
1 is Z1 then matching percentage is L / Z1. The same was done with ancestor 2 and
offspring but in blue color and percentage is K / Z2.
Gene Tracer algorithm was modified of Smith–Waterman (SW) local alignment algorithm.
The modifications are as following:
1- Determining locations of common substrings in both ancestors and
offspring sequences.
2- Distinguishing these common substrings by giving them clear and
different color.
Gene Tracer is of complexity O(max(M,N)*P) in computing time and memory space,
where M and N are respectively the lengths of the two ancestor sequences and P is the one
of the offspring sequence.
28
2.3
Multiple Sequence Alignment algorithms
In previous sections we consider alignments between two sequences. Here, we consider
the case where we wish to align three or more entire sequences (i.e., global multiple
sequence alignments).
Definition: Given k strings, S1, S2, …, SK, a multiple sequence alignment (MSA) is
obtained by inserting gap in the strings to make them all the same length.
e.g., the following is a MSA of 4 sequences MQPILLLV, MLRLL, MKILLL, and
MPPVLILV.
M
M
M
M
Q P I
L R K - I
P P V
L L L V
L L L L L L I L V
Note that no column may be all gaps.
Multiple sequence alignments are used for many reasons, including:
1.
To detect regions of variability or conservation in a family of proteins (that
might be missed in a pairwise alignment).
2.
To provide stronger evidence than pairwise similarity for structural and
functional inferences.
3.
To serve as the first step in other computational procedures such as:
 phylogenetic reconstruction
 RNA secondary structure, prediction
 In building profiles (probabilistic model) for protein families or
DNA signals these profiles can be used in locating other similar
sequences.
For pairwise alignments, we scored each column by looking at matches, mismatches, and
gaps in the two sequences (in practice protein sequences are scored using substitution
matrices).
29
However, now we have multiple characters in each column, and it is not obvious what the
best way to score a column is. There are many possibilities. The sum-of-pairs (SP) is a
common scoring scheme. Here, each column in an alignment is scored by summing the
scores of all pairs of symbols in that column. The score of the entire alignment is then
summed over all column scores. We will assume that a match = 1, a mismatch = -1, and a
gap = -2.
For example, the sum-of-pairs score of the 4th column of the MSA given earlier is:
SP(I,-,I,V) = score(I,-) + score(I,I) + score(I,V) + score(-,I) + score(-,V) +
score(I,V)
= -2+1-1-2-2-1= -7
Although there is never an entire column of gaps, if we look at any 2 sequences in the
alignment, there may be columns where both have gaps, in this case the value of score(-,-)
is set equal to zero.
A very popular and efficient heuristic algorithm for multiple alignment is CLUSTAL,
originally developed by Desmond Higgins and Paul Sharp at Trinity College, Dublin in
1988 and extended by Higgins, Julie Thompson, and Toby Gibson into the current version,
CLUSTALW [15].
CLUSTALW how it work?
CLUSTALW is a general purpose multiple sequence alignment program for
DNA or proteins and it’s work is as follow:
1.
The sequences are aligned pair-wise using the Needleman-Wunsch algorithm
(global alignment).
2.
Compute the alignment scores between all pairs of sequences.
3.
Build a guide tree that reflects the similarities between sequences, using the
pair-wise alignment distances.
4.
Align the sequences following the guide tree. Corresponding to each node in
the tree, the algorithm aligns the sequences or alignments that are associated
with its two daughter nodes. The process is repeated beginning from the
leaves (sequences) and ending with the tree root.
30
CHAPTER THREE
USING SEQUENCE ALIGNMENT ALGORITHMS
TO DEMONSTRATE THE GENETIC EVOLUTION OF
HEAT SHOCK FACTOR1 (HSF1) IN DIFFERENT
EUKARYOTIC ORGANISMS
3.1.
Overview
The process of gene tracing of specific gene or of all genome in species is very
important for biologist. Sometimes the aim of gene tracing process is to know the genetic
evolution that occurs in a certain protein sequence in specific organisms and also to know
the genetic evolution that occurs in a conserved domain in this protein sequence. This
process is very difficult one for biologists, but using computer algorithms makes this
problem easier. In this chapter we introduce a comprehensive study by using different
alignment algorithms that are explained in details in chapter two through which we
demonstrate the genetic evolution that occurs in heat shock factor protein 1 (HSF1) in
some eukaryotic organisms such as human, Danio rerio, Taurus, mouse, plant
(Arabidopsis) and yeast. In addition, this study will illustrate the molecular evolution in
the conserved domains of HSF1 (HSF_DNA-bind) throughout the different eukaryotic
organisms.
3.2.
Introduction
the focus of much recent biological evolution research has been on the detection of
similarities and dissimilarities among living species. As the study of nature continues,
human knowledge of variations among species has grown gradually both in all directions.
This knowledge is used to give names to species we know. Identifying, naming, and
organizing species into groups is a science called Taxonomy. Up till now, millions of
species have been identified [16], demonstrating the current vast knowledge about species.
Scientists go further to think about the cause and evolution of the observed similarities and
dissimilarities between different organisms. Many biologists including Charles Darwin's
and many others have suggested that all living organisms share a common ancestor
31
.Meanwhile, the differences among species are partly caused by mutations accumulated
over generations during the course of evolution .
Thus, Phylogenetic is a science that study of the evolutionary relatedness among species.
Researchers have established the links among seemingly different life forms, from
bacteria, to animals and plants using [17].
Pairwise global alignment algorithms are intended for comparing two sequences that are
entirely similar. A dynamic programming (DP) algorithm called Needleman and Wunsch
[18] was proposed for pairwise global alignment. Those methods are very useful in
analysis of DNA and protein sequences. There are other dynamic programming algorithms
for pairwise global alignment such as Huang and Chao [11] and NGILA [12]. There are
another algorithms for making pairwise local alignment such as the algorithm introduced
by Smith and Michael Waterman [13]. Another algorithm is used to make multiple
sequence alignment MSA, it introduced by Thompson et al [15]. The algorithm which
introduced by Thompson et al to make multiple sequence alignment depend on the
progressive alignment. This works by constructing a succession of pairwise alignment.
Initially, two sequences are chosen and aligned by standard pairwise alignment; this
alignment is fixed. Then, a third sequence is chosen and aligned to the first alignment, and
this process is iterated until all sequences have been aligned. Progressive alignment is
heuristic: it does not separate the process of scoring an alignment from the optimization
algorithm. It does not directly optimize any global scoring function of alignment
correctness. The advantage of progressive alignment is that it is fast and efficient, and in
many cases the resulting alignments are reasonable. General results are illustrated in
conclusion.
32
3.3.
Heat shock factor protein1 (HSF1)
In biochemistry, heat shock (HS) is the effect of subjecting a cell to a higher temperature
than of the ideal body temperature of the organism from which the cell line was derived.
Heat shock factor (HSF), in molecular biology, is the name given to transcription factors
that regulate the expression of the heat shock proteins. HSF1 is a member of the heat
shock transcription factor family and it considers as the major regulator of heat shock
protein transcription in eukaryotes. Protein-damaging stress lead to the activation of HSF1
which binds to upstream regulatory sequences in the promoters of heat shock genes
leading to enhanced heat shock gene expression.
Figure: 3.1. Structure of Heat Shock Factor Protein 1 Dna Binding Domain From Homo
Sapiens.
33
3.4.
Biological data used in our work
All biological data that are used are obtained from the national center for
biotechnology information (NCBI) and it has been written in FASTA format. FASTA
format is a one of the format types in writing sequence, and it is general format for all data
bases.
In our work we use the following data:
1. Data represent the entire sequence of heat shock factor protein1 of
[homo sapiens] in FASTA format
>gi|5031767|ref|NP_005517.1| heat shock factor protein 1 [Homo sapiens]
MDLPVGPGAAGPSNVPAFLTKLWTLVSDPDTDALICWSPSGNSF
HVFDQGQFAKEVLPKYFKHNNMASFVRQLNMYGFRKVVHIEQ
GGLVKPERDDTEFQHPCFLRGQEQLLENIKRKVTSVSTLKSEDIKI
RQDSVTKLLTDVQLMKGKQECMDSKLLAMKHENEALWREVAS
LRQKHAQQQKVVNKLIQFLISLVQSNRILGVKRKIPLMLNDSGSA
HSMPKYSRQFSLEHVHGSGPYSAPSPAYSSSSLYAPDAVASSGPII
SDITELAPASPMASPGGSIDERPLSSSPLVRVKEEPPSPPQSPRVEE
ASPGRPSSVDTLLSPTALIDSILRESEPAPASVTALTDARGHTDTE
GRPPSPPPTSTPEKCLSVACLDKNELSDHLDAMDSNLDNLQTML
SSHGFSVDTSALLDLFSPSVTVPDMSLPDLDSSLASIQELLSPQEPP
RPPEAENSSPDSGKQLVHYTAQPLFLLDPGSVDTGSNDLPVLFEL
GEGSYFSEGDGFAEDPTISLLTGSEPPKAKDPTVS
2. Data represent the conserved domain (HSF_DNA-bind) of heat
shock factor protein1 sequence of [homo sapiens] in FASTA format
NVPAFLTKLWTLVSDPDTDALICWSPSGNSFHVFDQGQFAKEVL
PKYFKHNNMASFVRQLNMYGFRKVVHIEQGGLVKPERDDTEFQ
HPCFLRGQEQLLENIKRK
34
3. Data represent the entire sequence of heat shock factor protein1 of
[Danio rerio] in FASTA format
>gi|18858865|ref|NP_571675.1| heat shock factor protein 1 [Danio rerio]
MEYHSVGPGGVVVTGNNVPAFLTKLWTLVEDPDTDPLICWSPN
GTSFHVFDQGRFSKEVLPKYFKHNNMASFVRQLNMYGFRKVVH
IEQGGLVKPEKDDTEFQHPYFIRGQEQLLENIKRKVTTVSNIKHE
DYKFSTDDVSKMISDVQHMKGKQESMDSKISTLKHENEMLWRE
VATLRQKHSQQQKVVNKLIQFLITLARSNRVLGVKRKMPLMLN
DSSSAHSMPKFSRQYSLESPAPSSTAFTGTGVFSSESPVKTGPIISD
ITELAQSSPVATDEWIEDRTSPLVHIKEEPSSPAHSPEVEEVCPVE
VEVGAGSDLPVDTPLSPTTFINSILQESEPVFRPDSAPSEQKCLSV
ACLDNYPQMSEITRLFSGFSTSSLHLRPHSGTELHDHLESIDSGLE
NLQQILNAQSINFDSSPLFDIFSSAASDVDLDSLASIQDLLSPDPVK
ETESGVDTDSGKQLVQYTSQPSFSPIPFSTDSSSTDLPMLLELQDD
SYFSSEPTEDPTIALLNFQPVPEDPSRTRIGDPCFKLKKESKR
4. Data represent the conserved domain (HSF_DNA-bind) of heat
shock factor protein1 sequence of [Danio rerio] in FASTA format
AFLTKLWTLVEDPDTDPLICWSPNGTSFHVFDQGRFSKEVLPKYF
KHNNMASFVRQLNMYGFRKVVHIEQGGLVKPEKDDTEFQHPYF
IRGQEQLLENIKRK
5. Data represent the entire sequence of heat shock factor protein1 of
[Bos taurus] in FASTA format
>gi|116003843|ref|NP_001070277.1| heat shock factor protein 1 [Bos taurus]
MDLPVGPGAAGPSNVPAFLTKLWTLVSDPDTDALICWSPSGNSF
HVLDQGQFAKEVLPKYFKHSNMASFVRQLNMYGFRKVVHIEQG
GLVKPERDDTEFQHPCFLRGQEQLLENIKRKVTSVSTLRSEDIKIR
QDSVTKLLTDVQLMKGKQESMDSKLLAMKHENEALWREVASL
RQKHAQQQKVVNKLIQFLISLVQSNRILGVKRKIPLMLNDGGPA
HPMPKYGRQYSLEHIHGPGPYPAPSPAYSGSSLYSPDAVTSSGPII
SDITELAPGSPVASSGGSVDERPLSSSPLVRVKEEPPSPPQSPRAEG
ASPGRPSSMVETPLSPTTLIDSILRESEPTPVASTTPLVDTGGRPPS
PLPASAPEKCLSVACLDKTELSDHLDAMDSNLDNLQTMLTSHGF
35
SVDTSTLLDLFSPSVTVPDMSLPDLDSSLASIQELLSPQEPPRPLEA
EKSSPDSGKQLVHYTAQPLLLLDPGSVDVGSSDLPVLFELGEGSY
FSEGDDYSDDPTISLLTGSEPPKAKDPTVS
6. Data represent the conserved domain (HSF_DNA-bind) of heat
shock factor protein1 sequence of [Bos taurus] in FASTA format
NVPAFLTKLWTLVSDPDTDALICWSPSGNSFHVLDQGQFAKEVL
PKYFKHSNMASFVRQLNMYGFRKVVHIEQGGLVKPERDDTEFQ
HPCFLRGQEQLLENIKRK
7. Data represent the entire sequence of heat shock factor protein1 of
[Mus musculus] in FASTA format
>gi|62740231|gb|AAH94064.1| Hsf1 protein [Mus musculus]
MDLAVGPGAAGPSNVPAFLTKLWTLVSDPDTDALICWSPSGNSF
HVFDQGQFAKEVLPKYFKHNNMASFVRQLNMYGFRKVVHIEQ
GGLVKPERDDTEFQHPCFLRGQEQLLENIKRKVTSVSTLKSEDIKI
RQDSVTRLLTDVQLMKGKQECMDSKLLAMKHENEALWREVAS
LRQKHAQQQKVVNKLIQFLISLVQSNRILGVKRKIPLMLSDSNSA
HSVPKYGRQYSLEHVHGPGPYSAPSPAYSSSSLYSSDAVTSSGPII
SDITELAPTSPLASPGRSIDERPLSSSTLVRVKQEPPSPPHSPRVLE
ASPGRPSSMDTPLSPTAFIDSILRESEPTPAASNTAPMDTTGAQAP
ALPTPSTPEKCLSVACLDNLARAPQMSGVARLFPCPSSFLHGRVQ
PGNELSDHLDAMDSNLDNLQTMLTSHGFSVDTSALLDIQELLSP
QEPPRPIEAENSNPDSAGALHGSASVPAGS
8. Data represent the conserved domain (HSF_DNA-bind) of heat
shock factor protein1 sequence of [Mus musculus] in FASTA format
NVPAFLTKLWTLVSDPDTDALICWSPSGNSFHVFDQGQFAKEVL
PKYFKHNNMASFVRQLNMYGFRKVVHIEQGGLVKPERDDTEFQ
HPCFLRGQEQLLENIKRK
36
9. Data represent the entire sequence of heat shock factor protein1 of
yeast [Saccharomyces cerevisiae] in FASTA format
>gi|1322586|emb|CAA96777.1| HSF1 [Saccharomyces cerevisiae]
MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDR
TDASTTSSTAIEDIINPSLDPQSAASPVPSSSFFHDSRKPSTSTHLVR
RGTPLGIYQTNLYGHNSRENTNPNSTLLSSKLLAHPPVPYGQNPD
LLQHAVYRAQPSSGTTNAQPRQTTRRYQSHKSRPAFVNKLWSM
LNDDSNTKLIQWAEDGKSFIVTNREEFVHQILPKYFKHSNFASFV
RQLNMYGWHKVQDVKSGSIQSSSDDKWQFENENFIRGREDLLE
KIIRQKGSSNNHNSPSGNGNPANGSNIPLDNAAGSNNSNNNISSS
NSFFNNGHLLQGKTLRLMNEANLGDKNDVTAILGELEQIKYNQI
AISKDLLRINKDNELLWQENMMARERHRTQQQALEKMFRFLTSI
VPHLDPKMIMDGLGDPKVNNEKLNSANNIGLNRDNTGTIDELKS
NDSFINDDRNSFTNATTNARNNMSPNNDDNSIDTASTNTTNRKK
NIDENIKNNNDIINDIIFNTNLANNLSNYNSNNNAGSPIRPYKQRY
LLKNRANSSTSSENPSLTPFDIESNNDRKISEIPFDDEEEEETDFRP
FTSRDPNNQTSENTFDPNRFTMLSDDDLKKDSHTNDNKHNESDL
FWDNVHRNIDEQDARLQNLENMVHILSPGYPNKSFNNKTSSTNT
NSNMESAVNVNSPGFNLQDYLTGESNSPNSVHSVPSNGSGSTPLP
MPNDNDTEHASTSVNQGENGSGLTPFLTVDDHTLNDNNTSEGST
RVSPDIKFSATENTKVSDNLPSFNDHSYSTQADTAPENAKKRFVE
EIPEPAIVEIQDPTEYNDHRLPKRAKK
10. Data represent the conserved domain (HSF_DNA-bind) of heat
shock factor protein1 sequence of yeast [Saccharomyces cerevisiae]
in FASTA format
SRPAFVNKLWSMLNDDSNTKLIQWAEDGKSFIVTNREEFVHQIL
PKYFKHSNFASFVRQLNMYGWHKVQDVKSGSIQSSSDDKWQFE
NENFIRGREDLLEKIIRQ
37
11. Data represent the entire sequence of heat shock factor protein1 of
plant [Arabidopsis thaliana] in FASTA format
>gi|7268528|emb|CAB78778.1| heat shock transcription factor HSF1 [Arabidopsis
thaliana]
MFVNFKYFSFFIRTKMDGVTGGGTNIGEAVTAPPPRNPHPATLLN
ANSLPPPFLSKTYDMVEDPATDAIVSWSPTNNSFIVWDPPEFSRD
LLPKYFKHNNFSSFVRQLNTYGFRKVDPDRWEFANEGFLRGQK
HLLKKISRRKSVQGHGSSSSNPQSQQLSQGQGSMAALSSCVEVG
KFGLEEEVEQLKRDKNVLMQELVKLRQQQQTTDNKLQVLVKH
LQVMEQRQQQIMSFLAKAVQNPTFLSQFIQKQTDSNMHVTEAN
KKRRLREDSTAATESNSHSHSLEASDGQIVKYQPLRNDSMMWN
MMKTDDKYPFLDGFSSPNQVSGVTLQEVLPITSGQSQAYASVPS
GQPLSYLPSTSTSLPDTIMPETSQIPQLTRESINDFPTENFMDTEKN
VPEAFISPSPFLDGGSVPIQLEGIPEDPEIDELMSNFEFLEEYMPESP
VFGDATTLENNNNNNNNNNNNNNNNNNNNTNGRHMDKLIEEL
GLLTSETEH
12. Data represent the conserved domain (HSF_DNA-bind) of heat
shock factor protein1 sequence of plant [Arabidopsis thaliana] in
FASTA format
PFLSKTYDMVEDPATDAIVSWSPTNNSFIVWDPPEFSRDLLPKYF
KHNNFSSFVRQLNTYGFRKVDPDRWEFANEGFLRGQKHLLKKIS
RRKS
3.5.
Results and Discussion
3.5.1.
Pairwise global alignment among the common conserved
domains (HSF_DNA-bind) and also among the entire sequence
of HSF1 protein
First, we have made a code by using MATLAB program to make
pairwise
global
alignment
among
the
common
conserved
domains
(HSF_DNA-bind) that is a part of the heat shock factor protein1 sequence
(HSF1) in all sequences of eukaryotic organisms that we are used and also we
made pairwise global alignment among the entire sequences of HSF1 protein.
38
The results were as follows:-
1) The results obtained by using sequence dot plot matrix in case of
the conserved domains (HSF_DNA-bind) in HSF1 protein
sequences
yeast
10
20
30
40
yeast
50
60
70
80
90
100
10
20
30
40
50
60
70
80
90
100
10
10
20
20
30
30
40
taurus
plant
40
50
50
60
60
70
70
80
80
90
90
100
Dot plot matrix between plant & yeast
Dot plot matrix between yeast & taurus
taurus
10
20
30
40
50
taurus
60
70
80
90
100
10
10
10
20
20
20
30
40
50
60
70
80
90
100
30
30
40
daniorerio
plant
40
50
50
60
60
70
70
80
80
90
90
100
Dot plot matrix between taurus & Danio rerio
Dot plot matrix between taurus & plant
yeast
taurus
20
30
40
50
60
70
80
90
10
100
10
10
20
20
30
30
40
40
mouse
human
10
50
20
30
40
50
60
70
80
90
100
50
60
60
70
70
80
80
90
90
100
100
dot plot matrix between yeast & mouse
dot plot matrix between taurus & human
39
taurus
10
20
30
40
50
mouse
60
70
80
90
100
10
10
20
30
40
50
60
70
80
90
100
10
20
20
30
30
40
plant
mouse
40
50
50
60
60
70
80
70
90
80
100
90
dot plot matrix between mouse & plant
dot plot matrix between taurus & mouse
human
10
20
30
40
50
60
70
80
90
mouse
100
10
20
30
40
50
60
70
80
90
100
10
10
20
20
30
30
40
daniorerio
mouse
40
50
60
50
60
70
70
80
80
90
90
100
100
dot plot matrix between human & mouse
dot plot matrix between mouse & Danio rerio
human
yeast
10
20
30
40
50
10
60
70
80
90
20
30
40
50
60
70
80
90
100
10
10
20
20
30
30
40
plant
human
40
50
50
60
60
70
80
70
90
80
100
90
dot plot matrix between yeast & human
dot plot matrix between plant & human
40
100
yeast
daniorerio
20
30
40
50
60
70
80
90
10
100
10
20
20
30
30
40
40
daniorerio
10
50
60
20
30
40
50
60
70
80
90
100
50
60
70
70
80
80
90
90
100
100
dot plot matrix between yeast & Danio rerio
dot plot matrix between Danio rerio & human
daniorerio
10
20
30
40
50
60
70
80
90
100
10
20
30
40
plant
human
10
50
60
70
80
90
dot plot matrix between Danio rerio & plant
Figure: 3.2. Results of sequence dot plot matrix among the conserved domains
(HSF_DNA-bind) in HSF1 protein sequences
41
2) The results obtained by using Needleman-Wunsch algorithm in case
of the conserved domains (HSF_DNA-bind) in HSF1 protein
sequences
In our work we use the following scoring matrices:1-
Blosum50
2-
Blosum30
Conserved domains (HSF_DNA-bind)
Human & Danio rerio
Human & Yeast
Human & Plant (Arabidopsis)
Human & Mouse
Human & Taurus
Danio rerio & Mouse
Danio rerio & Taurus
Danio rerio & Plant (Arabidopsis)
Danio rerio & Yeast
Mouse & Taurus
Mouse & Yeast
Mouse & Plant (Arabidopsis)
Taurus & Yeast
Taurus & Plant (Arabidopsis)
Yeast & Plant (Arabidopsis)
Score by
using
Blosum50
210.667
114.333
91.6667
244.667
240.333
210.667
206.333
92
103.667
240.333
114.333
91.667
116
88.6667
71.6667
Score by
Identities using
Blosum30
89%
135.2
46%
74
51%
60.8
100%
156.8
98%
153.6
89%
135.2
87%
132
51%
61.2
46%
66.6
98%
153.6
46%
74
51%
60.8
47%
75.2
50%
58.6
43%
51
Identities
89%
46%
51%
100%
98%
89%
87%
51%
46%
98%
46%
51%
47%
50%
43%
Table 3.1.This table shows the score of the pairwise global alignment and also the
percentage of similarity in the case of the pairwise global alignment among the
conserved domains (HSF_DNA-bind) in HSF1 protein sequence of all previous
organisms.
The results in table3.1 show that the highest similarity is between human and mouse
with percentage100%, but the least similarity is between yeast and plant (Arabidopsis)
with percentage 43% and these results are the same results obtained in figure 3.2.
42
3) The results obtained by using sequence dot plot matrix in case of
the entire sequences of HSF1 protein
yeast
100
0
200
300
yeast
400
500
600
700
800
100
0
50
50
100
100
150
200
300
400
500
600
700
800
150
200
plant
taurus
200
250
250
300
300
350
350
400
400
450
450
500
dot plot matrix between Taurus & yeast
dot plot matrix between yeast & plant
taurus
50
100
150
200
250
yeast
300
350
400
450
500
50
50
100
100
150
200
300
400
500
600
700
800
150
200
200
mouse
plant
100
0
250
300
250
300
350
350
400
400
450
450
dot plot matrix between Taurus & human
dot plot matrix between yeast & mouse
human
plant
50
100
150
200
250
300
350
400
50
450
150
200
250
300
350
400
50
50
100
100
150
150
200
mouse
200
mouse
100
250
250
300
300
350
350
400
400
450
450
dot plot matrix between human & mouse
dot plot matrix between plant & mouse
43
450
500
taurus
50
100
150
200
250
human
300
350
400
450
500
50
50
50
100
100
150
150
150
200
250
300
350
400
450
500
200
plant
200
250
250
300
300
350
350
400
400
450
450
dot plot matrix between human & plant
dot plot matrix between Taurus & mouse
daniorerio
yeast
100
200
300
400
500
600
700
50
800
50
50
100
100
150
150
200
200
taurus
daniorerio
0
250
300
100
150
200
300
350
400
450
500
250
350
400
400
450
450
500
500
dot plot matrix between yeast & Danio rerio
dot plot matrix between Taurus & Danio rerio
yeast
0
250
300
350
100
200
300
400
taurus
500
600
700
800
50
50
50
100
100
150
150
200
200
human
human
mouse
100
250
100
150
200
250
300
350
400
450
250
300
300
350
350
400
400
450
450
500
500
dot plot matrix between Taurus & human
dot plot matrix between yeast & human
44
500
daniorerio
daniorerio
50
100
150
200
250
300
350
400
450
50
500
50
50
100
100
150
150
150
200
250
300
350
400
450
500
mouse
200
250
300
250
300
350
350
400
400
450
450
dot plot matrix between danio rerio & mouse
dot plot matrix between danio rerio& plant
human
50
100
150
200
250
300
350
400
450
500
50
100
150
200
daniorerio
plant
200
100
250
300
350
400
450
500
dot plot matrix between human & danio rerio
Figure: 3.3. Results of sequence dot plot matrix among the entire sequences of
HSF1 protein.
45
4) The results obtained by using Needleman-Wunsch algorithm in
case of the entire sequence of HSF1 protein
Score by
Score by
HSF1 protein sequence
using
Identities using
Identities
Blosum50
Blosum30
Human & Danio rerio
544.333
56%
345.8
55%
Human & Yeast
440.333
22%
239.4
21%
Human & Plant (Arabidopsis)
-8
27%
23.8
24%
Human & Mouse
649.667
70%
406.6
70%
Human & Taurus
1017.33
89%
624.6
89%
Danio rerio & Mouse
415.333
52%
269.2
51%
Danio rerio & Taurus
565.333
56%
355.8
56%
Danio rerio & Plant (Arabidopsis) -7.6666
26%
23.8
25%
Danio rerio & Yeast
-397.667
22%
-212.4
20%
Mouse & Taurus
670
70%
418.6
70%
Mouse & Yeast
-547
22%
-306.6
23%
Mouse & Plant (Arabidopsis)
34.3333
27%
52.8
25%
Taurus & Yeast
-427
23%
-233.2
23%
Taurus & Plant (Arabidopsis)
-5
27%
24.6
24%
Yeast & Plant (Arabidopsis)
-450.333
24%
-237
23%
Table 3.2.This table shows the score of the pairwise global alignment and also the
percentage of similarity in case of the pairwise global alignment among the entire of
HSF1 protein sequences of all previous species.
The results in table3.2 show that the highest similarity is between human and Taurus with
percentage 89%, but the least similarity is between human and yeast, Danio rerio and
yeast, and mouse and yeast with percentage 22% and these results are the same results
obtained in figure 3.3.
3.5.2.
Multiple sequence alignment (MSA) among all the conserved
domains (HSF_DNA-bind) in HSF1 protein sequences and also
among all the entire sequences of HSF1 protein.
Second we have made a code by using MATLAB program to create multiple
sequence alignment (MSA) among all the conserved domains (HSF_DNA-bind) in HSF1
protein sequences of all previous organisms. We also have made MSA among all the
entire sequences of HSF1 protein sequence of those organisms. Multiple sequence
alignment (MSA) is used to ensure and specify the results obtained in table3.1 and
table3.2. The results are illustrated by using phylogenetic trees.
46
In our work we use the following scoring matrices:1- Blosum60
2- Blosum80
3- PAM10
The results were as follow:
1)
Comparative phylogenetic study of the conserved domains
(HSF_DNA-bind) in HSF1 protein sequences:-
Figure: 3.4. Graphical summary of the conserved domains of HSF1 protein in the different
organisms human, Danio rerio, Taurus, mouse, yeast, and plant.
47
a) Matrix blosum60
A danderogram is constructed by using an HSF1 domain (HSF_DNA-bind) of
human, mouse, Taurus, Danio rerio, plant (Arabidopsis), and yeast using the
scoring matrix blosum60. Based on this specific matrix the Human and Mouse
are the closest in the structure sequences of HSF1 domain (HSF_DNA-bind). It is
also shown that the plant (Arabidopsis) and the yeast HSF1 domain (HSF_DNAbind) are closer to each other than to the other organisms.
MSA among the conserved domain by using the scoring matrix blosum60
human
human
Branch 4 mouse
mouse
Branch 3
Branch 2
taurus
taurus
Branch 1
danio rerio
danio rerio
yeast
Root
arabidopsis
0
0.1
0.2
0.3
0.4
0.5
yeast
arabidopsis
0.6
Figure: 3.5. this figure shows the similarity between the conserved domains in HSF1
protein sequences in all the last species in case of MSA by using the scoring matrix
blosum60.
48
b) Matrix blosum80
When we use the scoring matrix blosum80 the same results are obtained as when
using blosum60.
MSA among the conserved domains by using the scoring matrix blosum 80
human
human
Branch 4 mouse
mouse
Branch 3
Branch 2
taurus
taurus
Branch 1
danio rerio
danio rerio
yeast
Root
arabidopsis
0
0.1
0.2
0.3
0.4
0.5
yeast
arabidopsis
0.6
Figure: 3.6. this figure shows the similarity between the conserved domains in HSF1
protein sequences in all the last species in case of MSA by using the scoring matrix
blosum80.
49
c) Matrix PAM10
When we use the scoring matrix PAM10 the same results obtained as when using
blosum60 and blosum80.
MSA among the conserved domain by using the scoring matrix pam10
human
human
Branch 4 mouse
mouse
Branch 3
Branch 2
taurus
Branch 1
taurus
danio rerio
danio rerio
yeast
Root
arabidopsis
0
0.1
0.2
0.3
0.4
0.5
yeast
arabidopsis
0.6
Figure: 3.7. this figure shows the similarity between the conserved domains in HSF1
protein sequences in all the last species in case of MSA by using the scoring matrix
PAM10.
50
2)
Comparative phylogenetic study in case of the entire length of
HSF1 protein sequences are as in the following phylogenetic
trees:a) Matrix blosum60
A danderogram is constructed using the entire length of HSF1 protein sequences
of human, mouse, Taurus, Danio rerio, plant (Arabidopsis), and yeast using the scoring
matrix blosum60. Based on this specific matrix the human and Taurus are the closest in
the structure of the entire length of HSF1 protein sequences. It is also shown that the plant
(Arabidopsis) and the yeast are much closer to each other than to the other organisms.
MSA among the entire sequences of HSF1 by using scoring matrix blosum60
human
human
taurus
taurus
Branch 3
Branch 4
Branch 2
mouse
mouse
Branch 1
danio rerio
daniorerio
Root
yeast
yeast
arabidopsis
0
0.2
0.4
0.6
0.8
1
1.2
1.4
arabidopsis
1.6
1.8
Figure: 3.8. this figure shows the similarity between all sequences of the last species in
case of HSF1 protein sequence by using the scoring matrix blosum60.
51
b) Matrix blosum80
When we use the scoring matrix blosum80 the same results are obtained as
when using blosum60.
MSA among the entire sequences of HSF1 by using the scoring matrix blosum80
human
human
taurus
taurus
Branch 3
Branch 4
Branch 2
mouse
mouse
danio rerio
danio rerio
Branch 1
yeast
Root
yeast
arabidopsis
arabidopsis
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Figure: 3.9. this figure shows the similarity between all sequences of the last species in
case of HSF1 protein sequence by using the scoring matrix blosum80.
52
c) Matrix PAM10
When we use the scoring matrix PAM10 the same results are obtained as when
using blosum60 and blosum80.
MSA among the entire sequences of HSF1 by using the scoring matrix pam10
arabidopsis
arabidopsis
Branch 1
yeast
yeast
human
human
taurus
taurus
Branch 3
Root
Branch 4
Branch 2
mouse
mouse
danio rerio danio rerio
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Figure: 3.10. this figure shows the similarity between all sequences of the last species
in case of HSF1 protein sequence by using the scoring matrix PAM10.
53
3.5.3.
Using Gene Tracer algorithm to specify some results obtained in
table3.1 and table3.2
In this section we use Gene Tracer algorithm [14] which described in details in chapter
two. As we know that Gene Tracer algorithm gives two ancestor sequences and their
offspring one, tracks down genes modification in the ancestor sequences, and finds related
parts of each ancestor in the offspring one. As shown in table3.1 and table3.2 human and
Taurus are more closer than other organisms on the entire length of the HSF1 sequence,
but the HSF1 conserved domain (HSF_DNA-bind) sequence is more similar between
human and mouse comparing to others. So that we use Gene Tracer algorithm to specify
the related parts of each ancestor sequence in the offspring one. Moreover, Gene Tracer is
used to find precisely the location of the ancestor sequences contribution inside the
offspring one and gives statistical results that express the relationship between the two
ancestor sequences and their offspring one. We consider Taurus as ancestor1 and mouse as
ancestor2, and human as the offspring. We have coded Gene Tracer algorithm in Perl and
have applied it on both the conserved domain sequences (HSF_DNA-bind) and the entire
sequence of HSF1 protein. The results were as in the following table:
Conserved domain
(HSF_DNA-bind)
Match percentage
The entire sequence
of HSF1 protein
Match percentage
Human & Mouse
100%
Human & Mouse
55.26%
Human & Taurus
96.19%
Human & Taurus
100%
Table 3.3.This table shows the match percentage of the pairwise local alignment in the
case of the entire sequence of HSF1 protein and also the conserved domain
(HSF_DNA-bind) in HSF1 protein sequence.
The results in table3.3 shows that the highest similarity in case of the conserved domain
sequence (HSF_DNA-bind) is between human and mouse with percentage 100% but in
case of the entire sequence of HSF1 the highest similarity is between human and Taurus
and these results confirm the results obtained in table3.2 and table3.3. The following
figures demonstrate in details and graphically the results obtained by using gene Tracer
Algorithm also from these figures we can specify the related parts of each ancestor
sequence in the offspring one.
54
Figure: 3.11. Results of Gene Tracer program in the case of the conserved domain
sequence (HSF_DNA-bind)
55
Figure: 3.12. Results of gene tracer program in the case of the entire sequence of HSF1
protein sequence
56
3.6.
Conclusion
Heat shock factor (HSF) is a transcriptional activator of heat shock genes. Heat shock
transcription factors (Hsfs) bind to conserved regulatory elements located in the promoters
of HSP genes, known as heat shock elements (HSEs) [19]. Upon activation, the HSFs bind
to HSEs and interact with proteins of the basal transcription machinery [20] and [21]. The
presence of common HSF recognition elements explains the possibility for the same HSFs
to be activated between different related organism [22]. The pairwise alignment
comparison of HSF1 among different studied eukaryotic organism (e.g., Human, Taurus,
Dania rerio, Mouse, Plant (Arabidopsis), Yeast) shows that the human and Taurus are
more closer on the entire length of the HSF1 using the scoring matrix BLOSUM30 and
BLOSUM50 as shown in table3.2 However, the HSF1 conserved domain (HSF_DNAbind) sequence was more similar between human and mouse comparing to others by using
the same blosum matrices as shown in table3.1. Similar results are obtained using multiple
sequence alignment. As shown in table3.3 results obtained by using gene tracer algorithm
ensured that the conserved domain (HSF_DNA-bind) in mouse is the same in human and
the entire sequence of HSF1 protein in human is the same as in Taurus also we can show
clearly the related parts between the sequences. One important result from these results is
that, if we make a comparison between the results obtained in table3.1 and table3.2 we can
specify that for all eukaryotic organisms we are used the degree of similarity in case of the
conserved domain (HSF_DNA-bind) is more than the degree of similarity in case of entire
sequence of HSF1 protein. These results accepted for publication with the title "Using
Sequence Alignment Algorithms to Demonstrate The Genetic Evolution of Heat Shock
Factor1 (Hsf1) In Different Eukaryotic Organisms" in an international journal called
international journal for engineering science and technology (ESTIJ) [23].
57
CHAPTER FOUR
MATHEMATICAL MODELING AND
CLASSIFICATION OF VIRUSES FROM HERPESVIRUS
FAMILY
4.1.
Overview
The process of modeling and classifications of viruses that belong to a specific
family is an important for biologist and for many biological applications. There are many
ways for Viruses families' classification. The degree of similarity or diversity among the
structure of the viruses capsid proteins is very useful in studying the Viruses families'
classification and their genetic evolution. It also important propose a mathematical model
of the virus life cycle to be able to fully understand the life cycle of Viruses families'
activities. In this chapter we introduce a proposed mathematical model for some
Herpesvirus family viruses simple life cycle and comprehensive study for its classification
using sequence alignment algorithms in order to demonstrate their genetic evolution
according to the structure of their capsid protein. Herpesvirus family is considered one of
the most important family in the families of the enveloped DNA viruses as it contain many
dangerous viruses for human health. This family contain one of newly discovered viruses
called Epstein–Barr virus (EBV), also called human herpesvirus 4 (HHV-4), which is one
of the most dangerous viruses. Infection with EBV occurs by the oral transfer of saliva
[24] and genital secretions.
4.2.
Introduction to viruses
Viruses are small infectious agents, but it considers the most dangerous infectious
agents for eukaryotic organisms. It can replicates only using living cells of the other
organisms. Viruses can multiply in living cells. Viruses once enter a host cell they
introduce their genetic material, DNA or RNA, inside the host cell. Viruses have a wide
range of host cells for infection ranging prokaryotic (e.g. Bacteria or Eukaryotic (e.g.
Plants, Animals, Human, Etc..) . A virus has simple structure where it consist of only two
components as shown in Fig 1
1. Genome: the genome of viruses is either ribonucleic acid (RNA) or deoxyribonucleic
acid (DNA); it contains only one type.
2. Coat or capsid protein: it is the protein that encloses the genome.
Figure: 4.1. Virus Structure components
58
DNA viruses has the following structure [25]
Figure: 4.2. General DNA Virus structure.
Our study is focused on enveloped DNA viruses.
4.3.
Enveloped DNA Viruses
Enveloped DNA viruses are found in three families [26] as shown in fig4.2:
1. Herpesviridae (Herpesviruse) family
2. Poxviridaea (Boxviruse) family
3. Hepadnaviridae (Hepadnaviruse) family
In this study we will focuses on the first family of enveloped DNA viruses which is
herpesviruse family colored in yellow, where we will propose a mathematical model for its
life cycle and also we introduces a comprehensive study to classify Herpes-virus family
members according to the structure of their capsid protein by use different sequence
alignment algorithms.
59
4.4.
Herpesviridae (Herpesviruse) family
Herpesviridae is a large family of DNA viruses that cause diseases in animals, including
humans. One of the unique feature of the Viruses in this family is that they have a doublestranded linear molecular structure with icosahedra symmetry [27][28][29]. The most
known name of the members of this family is herpesviruses. The word herpesviruses is
thought to be derived from the Greek word herpein, meaning ("to creep"), which refers to
the latent and recurring infections phenotype of this family of viruses. In this regard,
herpesviridae is known to cause latent or lytic infections. This family of viruses contains
five species of Herpesviridae which considered dangerous health factor for human health.
For instance, HSV-1 and HSV-2 (both of which can cause orolabial herpes and genital
herpes), Varicella zoster virus (which causes chicken-pox and shingles), Epstein-Barr
virus (which causes mononucleosis) and Cytomegalovirus all of which are considered
extremely widespread among humans. There are many reports indicated that at least one of
these viruses is infecting 90% , and a latent form of the virus remains in most people
[30][31][32]. There are 8 herpesvirus types: Herpes simplex viruses 1 and 2, varicellazoster virus, EBV (Epstein-Barr virus), human cytomegalovirus, human herpesvirus 6,
human herpesvirus 7, and Kaposi's sarcoma-associated herpesvirus [33].There are more
than 130 known herpesviruses [34], and some are isolated from many organisms such as
mammals, birds, fish, reptiles, amphibians, and mollusks [33].
Herpesvirus family contains the following viruses types:1.
2.
3.
4.
5.
6.
7.
8.
Herpes simplex1 (HSV1)
Herpes simplex2 (HSV2)
Epstein-barr virus (EBV) or (HHV4)
Human herpesvirus6 (HHV6)
Human herpesvirus7 (HHV7)
Human herpesvirus8 (HHV8)
Varicella-zoster virus (VZV)or (HHV3)
Human cytomegalovirus (HCMV) or (HHV5)
In this article we will try to classify these viruses according to their capsid protein
sequences structure using MatLab algorithms to understand the genetic evolution And the
phylogenetic relation of those families.
60
4.5.
Mathematical modeling of herpesvirus family life cycle
In this study, we will describe a mathematical model for the herpesviruses virus
general life cycles uses the host cell’s organelles to replicates new Viruses, which known
as Lytic cycle [35]. During this cycle the virus has to infect the cell, So that, the virus
attaches itself to the outer cell wall and releases enzymes that weaken the cell wall as
shown in Fig.3.
Figure: 4.3. Simple diagram represent herpesviruses Lytic Cycle
The lytic cycle is considered the main cycle in viral replication. As shown in Fig .3, the
viral DNA enters the cell and it transcribes itself into the host cell's messenger RNAs and
uses them to direct the ribosomes to produce viruses proteins including the capsid protein.
The virus takes over the cell's metabolic activities and the host cell's DNA is destroyed.
The virus produces progeny phages using the cell energy for its own propagation. The
viruses multiply and the original viruses releases enzymes to break the cell wall. The cell
wall bursts, in a process known as lysing. Thus, the new viruses are released [36].
The lytic cycle of a virus consists of six steps [37].The first two stages, called adsorption
and penetration. In the adsorption stage, the virus must attaches to a receptor on the cell
membrane to be able to enter the cell through the plasma membrane. In the penetration
stage, the virus releases its genetic materials into the cell. This is followed by the
integration stage in which the host cell gene expression is arrested, and viral materials are
embedded into the host cell nucleus. The fourth stage is biosynthesis that the virus uses the
cell machinery to make large amount of viral components, and at the meantime, destroys
the host's DNA. The last two stages, maturation and lysis where the mature virus particles
are formed and released.
61
In this chapter, the basic mathematical model for lytic cycle is proposed in four main
stages the model of a lytic cycle is converted into a system of equations. The solution of
the equations are solved numerically and described how the lytic phage terminates its
infection and breach its host's cell envelope over time t.
Then, the reaction scheme is described in the following four reactions as follows:
d/dt
decay
forward
forward
reverse
[x1]
-k1[x1]
0
0
k4[x4]
[x2]
k1[x1]
-k2[x2]
0
[x3]
0
[x4]
0
k2[x2]
-k3[x3]
0
k3[x3]
0
0
-k4[x4]
Table4.1.This table shows the modeling steps symbols
The dynamics of the system can be written as the following set of first order ODEs:
d[x1]/dt=k4[x4]−k1[x1],
d[x2]/dt=k1[x1]–k2[x2],
d[x3]/dt=k2[x2]−k3[x3],
d[x4]/dt = k3[x3] – k4[x4].
62
This system of ODEs can be numerically solved by using Matlab program. Then, the
mathematical model is solved for some initial values.
4
8
x 10
x1
7
x2
x3
6
x4
solution x
5
4
3
2
1
0
0
10
20
30
40
50
60
70
time t
Figure: 4.4.The solution to system of equations produced by ODE45.
4.6.
Biological data used in our work
All biological data used (which represents the sequences of the capsid protein of the
herpesvirus family) was obtained from the protein sequences repository of the national
center for biotechnology information (NCBI) and the UniProt Knowledgebase
(UniProtKB) which is the central hub for the collection of functional information on
proteins. All these data is written in FASTA format which is the most popular format types
used in writing sequence for bioinformatics applications, and it is a common format in
most biological data bases.
63
1. Data represent the entire sequence of major capsid protein of Herpes
simplex virus type 1
>sp|P06491|MCP_HHV11 Major capsid protein OS=Human herpesvirus 1
(strain 17) GN=UL19 PE=1 SV=1
MAAPNRDPPGYRYAAAMVPTGSLLSTIEVASHRRLFDFFSRVRS
DANSLYDVEFDALLGSYCNTLSLVRFLELGLSVACVCTKFPELA
YMNEGRVQFEVHQPLIARDGPHPIEQPTHNYMTKIIDRRALNAA
FSLATEAIALLTGEALDGTGISAHRQLRAIQQLARNVQAVLGAFE
RGTADQMLHVLLEKAPPLALLLPMQRYLDNGRLATRVARATLV
AELKRSFCETSFFLGKAGHRREAVEAWLVDLTTATQPSVAVPRL
THADTRGRPVDGVLVTTAPIKQRLLQSFLKVEDTEADVPVTYGE
MVLNGANLVTALVMGKAVRSLDDVGRHLLEMQEEQLDLNRQT
LDELESAPQTTRVRADLVSIGEKLVFLEALEKRIYAATNVPYPLV
GAMDLTFVLPLGLFNPVMERFAAHAGDLVPAPGHPDPRAFPPRQ
LFFWGKDRQVLRLSLEHAIGTVCHPSLMNVDAAVGGLNRDPVE
AANPYGAYVAAPAGPAADMQQLFLNAWGQRLAHGRVRWVAE
GQMTPEQFMQPDNANLALELHPAFDFFVGVADVELPGGDVPPA
GPGEIQATWRVVNGNLPLALCPAAFRDARGLELGVGRHAMAPA
TIAAVRGAFDDRNYPAVFYLLQAAIHGSEHVFCALARLVVQCIT
SYWNNTRCAAFVNDYSLVSYVVTYLGGDLPEECMAVYRDLVA
HVEALAQLVDDFTLTGPELGGQAQAELNHLMRDPALLPPLVWD
CDALMRRAALDRHRDCRVSAGGHDPVYAAACNVATADFNRND
GQLLHNTQARAADAADDRPHRGADWTVHHKIYYYVMVPAFSR
GRCCTAGVRFDRVYATLQNMVVPEIAPGEECPSDPVTDPAHPLH
PANLVANTVNAMFHNGRVVVDGPAMLTLQVLAHNMAERTTAL
LCSAAPDAGANTASTTNMRIFDGALHAGILLMAPQHLDHTIQNG
DYFYPLPVNALFAGADHVANAPNFPPALRDLSRQVPLVPPALGA
NYFSSIRQPVVQHVRESAAGENALTYALMAGYFKISPVALHHQL
KTGLHPGFGFTVVRQDRFVTENMLFSERASEAYFLGQLQVARHE
TGGGVNFTLTQPRGNVDLGVGYTAVVATATVRNPVTDMGNLP
QNFYLGRGAPPLLDNAAAVYLRNAVVAGNRLGPAQPVPVFGCA
QVPRRAGMDHGQDAVCEFIATPVSTDVNYFRRPCNPRGRAAGG
VYAGDKEGDVTALMYDHGQSDPSRAFAATANPWASQRFSYGD
LLYNGAYHLNGASPVLSPCFKFFTSADIAAKHRCLERLIVETGSA
64
VSTATAASDVQFKRPPGCRELVEDPCGLFQEAYPLTCASDPALL
RSARNGEAHARETHFAQYLVYDASPLKGLAL
2. Data represent the entire sequence of the Major capsid protein of
Human herpes simplex virus type2
>gi|360039880|gb|AEV91357.1| major capsid protein [Human herpesvirus 2]
MAAPARDPPGYRYAAAILPTGSILSTIEVASHRRLFDFFAAVRSD
ENSLYDVEFDALLGSYCNTLSLVRFLELGLSVACVCTKFPELAY
MNEGRVQFEVHQPLIARDGPHPVEQPVHNYMTKVIDRRALNAA
FSLATEAIALLTGEALDGTGISLHRQLRAIQQLARNVQAVLGAFE
RGTADQMLHVLLEKAPPLALLLPMQRYLDNGRLATRVARATLV
AELKRSFCDTSFFLGKAGHRREAIEAWLVDLTTATQPSVAVPRL
THADTRGRPVDGVLVTTAAIKQRLLQSFLKVEDTEADVPVTYGE
MVLNGANLVTALVMGKAVRSLDDVGRHLLDMQEEQLEANRET
LDELESAPQTTRVRADLVAIGDRLVFLEALERRIYAATNVPYPLV
GAMDLTFVLPLGLFNPAMERFAAHAGDLVPAPGHPEPRAFPPRQ
LFFWGKDHQVLRLSMENAVGTVCHPSLMNIDAAVGGVNHDPV
EAANPYGAYVAAPAGPGADMQQRFLNAWRQRLAHGRVRWVA
ECQMTAEQFMQPDNANLALELHPAFDFFAGVADVELPGGEVPP
AGPGAIQATWRVVNGNLPLALCPVAFRDARGLELGVGRHAMAP
ATIAAVRGAFEDRSYPAVFYLLQAAIHGNEHVFCALARLVTQCIT
SYWNNTRCAAFVNDYSLVSYIVTYLGGDLPEECMAVYRDLVAH
VEALAQLVDDFTLPGPELGGQAQAELNHLMRDPALLPPLVWDC
DGLMRHAALDRHRDCRIDAGGHEPVYAAACNVATADFNRNDG
RLLHNTQARAADAADDRPHRPADWTVHHKIYYYVLVPAFSRGR
CCTAGVRFDRVYATLQNMVVPEIAPGEECPSDPVTDPAHPLHPA
NLVANTVKRMFHNGRVVVDGPAMLTLQVLAHNMAERTTALLC
SAAPDAGANTASTANMRIFDGALHAGVLLMAPQHLDHTIQNGE
YFYVLPVHALFAGADHVANAPNFPPALRDLARDVPLVPPALGA
NYFSSIRQPVVQHARESAAGENALTYALMAGYFKMSPVALYHQ
LKTGLHPGFGFTVVRQDRFVTENVLFSERASEAYFLGQLQVARH
ETGGGVNFTLTQPRGNVDLGVGYTAVAATGTVRNPVTDMGNLP
QNFYLGRGAPPLLDNAAAVYLRNAVVAGNRLGPAQPLPVFGCA
QVPRRAGMDHGQDAVCEFIATPVATDINYFRRPCNPRGRAAGG
VYAGDKEGDVIALMYDHGQSDPARPFAATANPWASQRFSYGDL
LYNGAYHLNGASPVLSPCFKFFTAADITAKHRCLERLIVETGSAV
65
STATAASDVQFKRPPGCRELVEDPCGLFQEAYPITCASDPALLRS
ARDGEAHARETHFTQYLIYDASPLKGLSL
3. Data represent the entire sequence of the Major capsid protein of
Epstein-Barr virus (EBV)
>tr|V5KU49|V5KU49_EBVG Major capsid protein OS=Epstein-Barr virus (strain
GD1) GN=BcLF1 PE=4 SV=1
MASNEGVENRPFPYLTVDADLLSNLRQSAAEGLFHSFDLLVGKD
AREAGIKFEVLLGVYTNAIQYVRFLETALAVSCVNTEFKDLSRM
TDGKIQFRISVPTIAHGDGRRPSKQRTFIVVKNCHKHHISTEMELS
MLDLEILHSIPETPVEYAEYVGAVKTVASALQFGVDALERGLINT
VLSVKLRHAPPMFILQTLADPTFTERGFSKTVKSDLIAMFKRHLL
EHSFFLDRAENMGSGFSQYVRSRLSEMVAAVSGESVLKGVSTYT
TAKGGEPVGGVFIVTDNVLRQLLTFLGEEADNQIMGPSSYASFV
VRGENLVTAVSYGRVMRTFEHFMARIVDSPEKAGSTKSDLPAV
AAGVEDQPRVPISAAVIKLGNHAVAVESLQKMYNDTQSPYPLNR
RMQYSYYFPVGLFMPNPKYTTSAAIKMLDNPTQQLPVEAWIVN
KNNLLLAFNLQNALKVLCHPRLHTPAHTLNSLNAAPAPRDRRET
YSLQHRRPNHMNVLVIVDEFYDNKYAAPVTDIALKCGLPTEDFL
HPSNYDLLRLELHPLYDIYIGRDAGERARHRAVHRLMVGNLPTP
LAPAAFQEARGQQFETATSLAHVVDQAVIETVQDTAYDTAYPAF
FYVVEAMIHGFEEKFVMNVPLVSLCINTYWERSGRLAFVNSFSM
IKFICRHLGNNAISKEAYSMYRKIYGELIALEQALMRLAGSDVVG
DESVGQYVCALLDPNLLPPVAYTDIFTHLLTVSDRAPQIIIGNEVY
ADTLAAPQFIERVGNMDEMAAQFVALYGYRVNGDHDHDFRLH
LGPYVDEGHADVLEKIFYYVFLPTCTNAHMCGLGVDFQHVAQT
LAYNGPAFSHHFTRDEDILDNLENGTLRDLLEISDLRPTVGMIRD
LSASFMTCPTFTRAVRVSVDNDVTQQLAPNPADKRTEQTVLVN
GLVAFAFSERTRAVTQCLFHAIPFHMFYGDPRVAATMHQDVATF
VMRNPQQRAVEAFNRPEQLFAEYREWHRSPMGKYAAECLPSLV
SISGMTAMHIKMSPMAYIAQAKLKIHPGVAMTVVRTDEILSENIL
FSSRASTSMFIGTPNVSRREARVDAVTFEVHHEMASIDTGLSYSS
TMTPARVAAITTDMGIHTQDFFSVFPAEAFGNQQVNDYIKAKVG
AQRNGTLLRDPRTYLAGMTNVNGAPGLCHGQQATCEIIVTPVTA
DVAYFQKSNSPRGRAACVVSCENYNQEVAEGLIYDHSRPDAAY
66
EYRSTVNPWASQLGSLGDIMYNSSYRQTAVPGLYSPCRAFFNKE
ELLRNNRGLYNMVNEYSQRLGGHPATSNTEVQFVVIAGTDVFLE
QPCSFLQEAFPALSASSRALIDEFMSVKQTHAPIHYGHYIIEEVAP
VRRILKFGNKVVF
4. Data represent the entire sequence of the Major capsid protein of
Human herpesvirus 6 (HHV6)
>sp|P17887|MCP_HHV6U Major capsid protein OS=Human herpesvirus 6A
(strain Uganda-1102) GN=U57 PE=3 SV=3
MENWQATEILPKIEAPLNIFNDIKTYTAEQLFDNLRIYFGDDPSRY
NISFEALLGIYCNKIEWINFFTTPIAVAANVIRFNDVSRMTLGKVL
FFIQLPRVATGNDVTASKETTIMVAKHSEKHPINISFDLSAACLEH
LENTFKNTVIDQILNINALHTVLRSLKNSADSLERGLIHAFMQTLL
RKSPPQFIVLTMNENKVHNKQALSRVQRSNMFQSLKNRLLTSLF
FLNRNNNISYIYRILNDMMESVTESILNDTNNYTSKENVPLDGVL
LGPIGSIQKLTSILSQYISTQVVSAPISYGHFIMGKENAVTAIAYRA
IMADFTQFTVNAGTEQQDTNNKSEIFDKSRAYADLKLNTLKLGD
KLVAFDHLHKVYKNTDVNDPLEQSLQLTFFFPLGIYIPSETGFST
METRVKLNDTMENNLPTSVFFHNKDQVVQRIDFADILPSVCHPI
VHDSTIVERLMKSEPLPTGHRFSQLCQLKITRENPARILQTLYNLY
ESRQEVPKNTNVLKNELNIEDFYKPDNPTLPTERHPFFDLTYIQK
NRATEVLCTPRIMIGNIPLPLAPVSFHEARTNQILEHAKTNCQKY
DFTLKIVTESLTSGSYPELAYVIETLVHGNKHAFMILKQVISQCIS
YWFNMKHILLFCNSFEMIMLISNHMGDELIPGAAFAHYRNLVSLI
RLVKRTISISNLNEQLCGEPLVNFANALFDGRLFCPFVHTMPRND
TNAKITADDTPLTQNTVRVRNYEISDVQRMNLIDSSVVFTDNDR
PSNETTILSKIFYFCVLPALSNNKACGAGVNVKELVLDLFYTEPFI
SPDDYFQENPITSDVLMSLIREGMGPGYTVANTSCIAKQLFKSLIY
INENTKILEVEVSLDPAQRHGNSVHFQSLQHILYNGLCLISPITTLR
RYYQPIPFHRFFSDPGICGTMNADIQVFLNTFPHCQRNDGGFPLPP
PLALEFYNWQRTPFSVYSAFCPNSLLSIMTLAAMHSKLSPVAIAI
QSKNKIHPGFAATLVRTDNFDVECLLYSSRAATSIILDDPTVTAE
AKDIATTYNFTQHLSFVDMGLGFSSTTATANLKRIKSDMGSKIQ
NLFSAFPIHAFTNADINTWIRHHVGIEKPNPSESEALNIITFGGINK
NPPSILLHGQQAICEVILTPVTTNINFFKSPHNPRGRESCMMGTDP
HNEEAARKALYDHTQTDSDTFAATTNPWASLPGSLGDILYNTAH
67
REQLCYNPKTYSPNAQFFTESDILKTNKMMYKVISEYCMKSNSC
LNSDSEIQYSCSEGTDSFVSRPCQFLQNALPLHCSSNQALLESRSK
TGNTQISETHYCNYAIGETIPFQLIIESSI
5. Data represent the entire sequence of the Major capsid protein of
Human herpesvirus 7 (HHV7)
>sp|P52347|MCP_HHV7J Major capsid protein OS=Human herpesvirus 7 (strain
JI) GN=U57 PE=3 SV=1
MENWRTAEIFPKLDVSPNVFDDIRTQTAEQLFENLRLYYGDDSD
RYNISFEALLGIYCNRTEWIDFFHTSIAVAANVIRFNDLDKMSLG
KILFYIQLPRVATGNDVTAPKETTVLVTKYSEKHPINISFELSAAC
LAHLENTFKNTILDQMLNINAIHTVLRSLKNSADSLQRGLIYAFIK
TILKKAPPQFILKTMLENKVNSKQILSKVQRSNMFQNFKNKLINS
LFFLNRTSNVSFIYRYLCEMVDSTTESILNNTNSYVLKDGTPINGV
LLGTPNTIQILSNALSQHISQMTMSVPVSYGTFVMGKENAVTAIA
YQAIMADFSNYTKNVATETQDQNKKSEIFENQTQHADLKTNIIQ
LSDKTVVLDHLKKVYKNTNIEDPLEQKLELTFFFPMGLYISKDSG
FSTMDSRLKLNDTMENNLPTSIYFYNKDKLLQRIDYSDLLPSLCH
PIIFDCSVSERIFKNAAKPTGESFNQLCQVEFVREPPSTFLSNLYNL
YEMKKEIPKTTNMLKNELTTEDFYKSENFTLKTELHPFFDFTYIQ
KNRSTDVLCSPRILLGNIPLPLAPSSFHEARTNQMIEQAKTNNLN
YDYTLKLVVESLTNTAYPELAYIIELLIHGNKTAFQILKDVVSQCI
TYWYNIKHILLFCNNFEMIWLITTYLGDESIPGIAYTHYKNIISILK
LVKRTISISNFNEQLCGEPLVGFVNALFDNRLFPPFLNSLPKNEAN
AIITAGNTPLTQNTVKLRNYEVSDLNRMNLLDSTEIFTDVDRPSF
ETIVLSKIFYFCFLPALTNNKMCGAGFDVKSFILDFFYTEPFILPDD
NFCELPITNNVLIELITEAVGPSHALTDLSCIGKQLFKSILYLTENT
KILEIESSLDPSQRHGSSSNFKSLQHVLYNGLCLVSPINVLKRYFK
PIPFNRFFSDPIICGLMNIEVQTYLNIFPHYQRNDGGFPLPQALSHE
FHNWQRTPFFVYASCCSNSLLSIMTLATMHCKLSPIAIILQSRQKI
HPGFAATLVRTDCFDINCLLYSSKSATSIMIDDPTVSTEVKDISTT
YNLTQHISFLDMGLGFSSSTAIANLKRVKTDMGSKVQDLFSVFP
MHAYTNPTVNSWVRHHVGIEKPNPSETDALNILSFGKINKQSQSI
LLHGQQAICEVVITPVTSDINFYKTPKNPRGRASCMMGVDPHNE
SEARKSLYDHSRVDSDAFVATTNPWASQEGSLSDVLYNINHRDQ
LGYNPKSYSPNAVFFTDTEIFKTNKFMFKLISDYSIKTKTCLDSDT
68
DIQYSCSEGTDDVTHRPCQFLQIAFPIHCSSNQALLESRSKNGMT
QLSETHFANFAIGECIPLQNIIESLL
6. Data represent the entire sequence of the Major capsid protein of
Human herpesvirus 8 (HHV8)
>sp|Q2HRA7|MCP_HHV8P Major capsid protein OS=Human herpesvirus 8 type
P (isolate GK18) GN=ORF25 PE=3 SV=1
MEATLEQRPFPYLATEANLLTQIKESAADGLFKSFQLLLGKDARE
GSVRFEALLGVYTNVVEFVKFLETALAAACVNTEFKDLRRMIDG
KIQFKISMPTIAHGDGRRPNKQRQYIVMKACNKHHIGAEIELAAA
DIELLFAEKETPLDFTEYAGAIKTITSALQFGMDALERGLVDTVL
AVKLRHAPPVFILKTLGDPVYSERGLKKAVKSDMVSMFKAHLIE
HSFFLDKAELMTRGKQYVLTMLSDMLAAVCEDTVFKGVSTYTT
ASGQQVAGVLETTDSVMRRLMNLLGQVESAMSGPAAYASYVV
RGANLVTAVSYGRAMRNFEQFMARIVDHPNALPSVEGDKAALA
DGHDEIQRTRIAASLVKIGDKFVAIESLQRMYNETQFPCPLNRRIQ
YTYFFPVGLHLPVPRYSTSVSVRGVESPAIQSTETWVVNKNNVPL
CFGYQNALKSICHPRMHNPTQSAQALNQAFPDPDGGHGYGLRY
EQTPNMNLFRTFHQYYMGKNVAFVPDVAQKALVTTEDLLHPTS
HRLLRLEVHPFFDFFVHPCPGARGSYRATHRTMVGNIPQPLAPRE
FQESRGAQFDAVTNMTHVIDQLTIDVIQETAFDPAYPLFCYVIEA
MIHGQEEKFVMNMPLIALVIQTYWVNSGKLAFVNSYHMVRFICT
HIGNGSIPKEAHGHYRKILGELIALEQALLKLAGHETVGRTPITHL
VSALLDPHLLPPFAYHDVFTDLMQKSSRQPIIKIGDQNYDNPQNR
ATFINLRGRMEDLVNNLVNIYQTRVNEDHDERHVLDVAPLDEN
DYNPVLEKLFYYVLMPVCSNGHMCGMGVDYQNVALTLTYNGP
VFADVVNAQDDILLHLENGTLKDILQAGDIRPTVDMIRVLCTSFL
TCPFVTQAARVITKRDPAQSFATHEYGKDVAQTVLVNGFGAFA
VADRSREAAETMFYPVPFNKLYADPLVAATLHPLLPNYVTRLPN
QRNAVVFNVPSNLMAEYEEWHKSPVAAYAASCQATPGAISAMV
SMHQKLSAPSFICQAKHRMHPGFAMTVVRTDEVLAEHILYCSRA
STSMFVGLPSVVRREVRSDAVTFEITHEIASLHTALGYSSVIAPAH
VAAITTDMGVHCQDLFMIFPGDAYQDRQLHDYIKMKAGVQTGS
PGNRMDHVGYTAGVPRCENLPGLSHGQLATCEIIPTPVTSDVAY
FQTPSNPRGRAASVVSCDAYSNESAERLFYDHSIPDPAYECRSTN
69
NPWASQRGSLGDVLYNITFRQTALPGMYSPCRQFFHKEDIMRYN
RGLYTLVNEYSARLAGAPATSTTDLQYVVVNGTDVFLDQPCHM
LQEAYPTLAASHRVMLAEYMSNKQTHAPVHMGQYLIEEVAPM
KRLLKLGNKVVY
7. Data represent the entire sequence of the Major capsid protein of
Human cytomegalovirus
>sp|P16729|MCP_HCMVA Major capsid protein OS=Human cytomegalovirus
(strain AD169) GN=UL86 PE=3 SV=1
MENWSALELLPKVGIPTDFLTHVKTSAGEEMFEALRIYYGDDPE
RYNIHFEAIFGTFCNRLEWVYFLTSGLAAAAHAIKFHDLNKLTTG
KMLFHVQVPRVASGAGLPTSRQTTIMVTKYSEKSPITIPFELSAA
CLTYLRETFEGTILDKILNVEAMHTVLRALKNTADAMERGLIHSF
LQTLLRKAPPYFVVQTLVENATLARQALNRIQRSNILQSFKAKM
LATLFLLNRTRDRDYVLKFLTRLAEAATDSILDNPTTYTTSSGAK
ISGVMVSTANVMQIIMSLLSSHITKETVSAPATYGNFVLSPENAV
TAISYHSILADFNSYKAHLTSGQPHLPNDSLSQAGAHSLTPLSMD
VIRLGEKTVIMENLRRVYKNTDTKDPLERNVDLTFFFPVGLYLPE
DRGYTTVESKVKLNDTVRNALPTTAYLLNRDRAVQKIDFVDAL
KTLCHPVLHEPAPCLQTFTERGPPSEPAMQRLLECRFQQEPMGG
AARRIPHFYRVRREVPRTVNEMKQDFVVTDFYKVGNITLYTELH
PFFDFTHCQENSETVALCTPRIVIGNLPDGLAPGPFHELRTWEIME
HMRLRPPPDYEETLRLFKTTVTSPNYPELCYLVDVLVHGNVDAF
LLIRTFVARCIVNMFHTRQLLVFAHSYALVTLIAEHLADGALPPQ
LLFHYRNLVAVLRLVTRISALPGLNNGQLAEEPLSAYVNALHDH
RLWPPFVTHLPRNMEGVQVVADRQPLNPANIEARHHGVSDVPR
LGAMDADEPLFVDDYRATDDEWTLQKVFYLCLMPAMTNNRAC
GLGLNLKTLLVDLFYRPAFLLMPAATAVSTSGTTSKESTSGVTPE
DSIAAQRQAVGEMLTELVEDVATDAHTPLLQACRELFLAVQFV
GEHVKVLEVRAPLDHAQRQGLPDFISRQHVLYNGCCVVTAPKT
LIEYSLPVPFHRFYSNPTICAALSDDIKRYVTEFPHYHRHDGGFPL
PTAFAHEYHNWLRSPFSRYSATCPNVLHSVMTLAAMLYKISPVS
LVLQTKAHIHPGFALTAVRTDTFEVDMLLYSGKSCTSVIINNPIVT
KEERDISTTYHVTQNINTVDMGLGYTSNTCVAYVNRVRTDMGV
RVQDLFRVFPMNVYRHDEVDRWIRHAAGVERPQLLDTETISML
TFGSMSERNAAATVHGQKAACELILTPVTMDVNYFKIPNNPRGR
70
ASCMLAVDPYDTEAATKAIYDHREADAQTFAATHNPWASQAG
CLSDVLYNTRHRERLGYNSKFYSPCAQYFNTEEIIAANKTLFKTI
DEYLLRAKDCIRGDTDTQYVCVEGTEQLIENPCRLTQEALPILST
TTLALMETKLKGGAGAFATSETHFGNYVVGEIIPLQQSMLFNS
8. Data represent the entire sequence of the Major capsid protein
Varicella-zoster virus
>sp|P09245|MCP_VZVD Major capsid protein OS=Varicella-zoster virus (strain
Dumas) GN=40 PE=3 SV=1
MTTVSCPANVITTTESDRIAGLFNIPAGIIPTGNVLSTIEVCAHRCI
FDFFKQIRSDDNSLYSAQFDILLGTYCNTLNFVRFLELGLSVACIC
TKFPELAYVRDGVIQFEVQQPMIARDGPHPVDQPVHNYMVKRIH
KRSLSAAFAIASEALSLLSNTYVDGTEIDSSLRIRAIQQMARNLRT
VLDSFERGTADQLLGVLLEKAPPLSLLSPINKFQPEGHLNRVARA
ALLSDLKRRVCADMFFMTRHAREPRLISAYLSDMVSCTQPSVM
VSRITHTNTRGRQVDGVLVTTATLKRQLLQGILQIDDTAADVPV
TYGEMVLQGTNLVTALVMGKAVRGMDDVARHLLDITDPNTLNI
PSIPPQSNSDSTTAGLPVNARVPADLVIVGDKLVFLEALERRVYQ
ATRVAYPLIGNIDITFIMPMGVFQANSMDRYTRHAGDFSTVSEQ
DPRQFPPQGIFFYNKDGILTQLTLRDAMGTICHSSLLDVEATLVA
LRQQHLDRQCYFGVYVAEGTEDTLDVQMGRFMETWADMMPH
HPHWVNEHLTILQFIAPSNPRLRFELNPAFDFFVAPGDVDLPGPQ
RPPEAMPTVNATLRIINGNIPVPLCPISFRDCRGTQLGLGRHTMTP
ATIKAVKDTFEDRAYPTIFYMLEAVIHGNERNFCALLRLLTQCIR
GYWEQSHRVAFVNNFHMLMYITTYLGNGELPEVCINIYRDLLQH
VRALRQTITDFTIQGEGHNGETSEALNNILTDDTFIAPILWDCDAL
IYRDEAARDRLPAIRVSGRNGYQALHFVDMAGHNFQRRDNVLI
HGRPVRGDTGQGIPITPHHDREWGILSKIYYYIVIPAFSRGSCCTM
GVRYDRLYPALQAVIVPEIPADEEAPTTPEDPRHPLHAHQLVPNS
LNVYFHNAHLTVDGDALLTLQELMGDMAERTTAILVSSAPDAG
AATATTRNMRIYDGALYHGLIMMAYQAYDETIATGTFFYPVPV
NPLFACPEHLASLRGMTNARRVLAKMVPPIPPFLGANHHATIRQP
VAYHVTHSKSDFNTLTYSLLGGYFKFTPISLTHQLRTGFHPGIAFT
VVRQDRFATEQLLYAERASESYFVGQIQVHHHDAIGGVNFTLTQ
PRAHVDLGVGYTAVCATAALRCPLTDMGNTAQNLFFSRGGVPM
LHDNVTESLRRITASGGRLNPTEPLPIFGGLRPATSAGIARGQASV
71
CEFVAMPVSTDLQYFRTACNPRGRASGMLYMGDRDADIEAIMF
DHTQSDVAYTDRATLNPWASQKHSYGDRLYNGTYNLTGASPIY
SPCFKFFTPAEVNTNCNTLDRLLMEAKAVASQSSTDTEYQFKRPP
GSTEMTQDPCGLFQEAYPPLCSSDAAMLRTAHAGETGADEVHL
AQYLIRDASPLRGCLPLPR
4.7. Results and discussion
4.7.1. Pairwise global alignment among the entire sequences of
herpesviruses family
First, we have made a code by using MATLAB program to make pairwise global
alignment among the entire sequences of herpesviruses family .
The results were as follow
1)
The results obtained by using sequence dot plot matrix
herpes simplex virus2
0
0
200
400
600
800
200
herpes simplex virus1
400
600
800
1000
1200
dot plot matrix between HSV1 & HSV2
72
1000
1200
EBV
0
0
200
400
600
800
1000
1200
200
herpes simplex virus1
400
600
800
1000
1200
dot plot matrix between HSV1 & EBV
herpes simplex virus1
0
0
200
400
600
800
200
HHV6
400
600
800
1000
1200
dot plot matrix between HSV1 & HHV6
73
1000
1200
herpes simplex virus1
0
0
200
400
600
800
1000
1200
200
400
HHV7
600
800
1000
1200
dot plot matrix between HSV1 & HHV7
HHV8
0
0
200
400
600
800
200
herpes simplex virus1
400
600
800
1000
1200
dot plot matrix between HSV1 & HHV8
74
1000
1200
varicella-zoster virus
0
0
200
400
600
800
1000
1200
200
herpes simplex virus1
400
600
800
1000
1200
dot plot matrix between HSV1 & VZV
herpes simplex virus1
0
0
200
400
600
800
200
human cytomegalovirus
400
600
800
1000
1200
dot plot matrix between HSV1 & HCMV
75
1000
1200
EBV
0
0
200
400
600
800
1000
1200
200
HHV6
400
600
800
1000
1200
dot plot matrix between EBV & HHV6
EBV
0
0
200
400
600
800
200
herpes simplex virus2
400
600
800
1000
1200
dot plot matrix between HSV2 & EBV
76
1000
1200
EBV
0
0
200
400
600
800
1000
1200
200
400
HHV8
600
800
1000
1200
dot plot matrix between EBV & HHV8
EBV
0
0
200
400
600
800
200
HHV7
400
600
800
1000
1200
dot plot matrix between EBV & HHV7
77
1000
1200
varicella-zoster virus
0
0
200
400
600
800
1000
1200
200
400
EBV
600
800
1000
1200
dot plot matrix between VZV & EBV
EBV
0
0
200
400
600
800
200
human cytomegalo virus
400
600
800
1000
1200
dot plot matrix between EBV & HCMV
78
1000
1200
herpes simplex virus2
0
0
200
400
600
800
1000
1200
1000
1200
200
HHV6
400
600
800
1000
1200
dot plot matrix between HSV2 & HHV6
herpes simplex virus2
0
0
200
400
600
800
200
HHV7
400
600
800
1000
1200
dot plot matrix between HSV2 & HHV7
79
HHV8
0
0
200
400
600
800
1000
1200
200
herpes simplex virus2
400
600
800
1000
1200
dot plot matrix between HSV2 & HHV8
herpes simplex virus2
0
0
200
400
600
800
200
human cytomegalo virus
400
600
800
1000
1200
dot plot matrix between HSV2 & HCMV
80
1000
1200
HHV7
0
0
200
400
600
800
1000
1200
200
400
HHV6
600
800
1000
1200
dot plot matrix between HHV6 & HHV7
HHV8
0
0
200
400
600
800
200
HHV6
400
600
800
1000
1200
dot plot matrix between HHV6 & HHV8
81
1000
1200
varicella-zoster virus
0
0
200
400
600
800
1000
1200
200
HHV6
400
600
800
1000
1200
dot plot matrix between HHV6 & VZV
human cytomegalo virus
0
0
200
400
600
800
200
HHV6
400
600
800
1000
1200
dot plot matrix between HHV6 & HCMV
82
1000
1200
HHV8
0
0
200
400
600
800
1000
1200
200
HHV7
400
600
800
1000
1200
dot plot matrix between HHV7 & HHV8
varicella-zoster virus
0
0
200
400
600
800
200
HHV7
400
600
800
1000
1200
dot plot matrix between HHV7 & VZV
83
1000
1200
human cytomegalo virus
0
0
200
400
600
800
1000
1200
200
HHV7
400
600
800
1000
1200
dot plot matrix between HHV7 & HCMV
varicella-zoster virus
0
0
200
400
600
800
200
400
HHV8
600
800
1000
1200
dot plot matrix between HHV8 & VZV
84
1000
1200
HHV8
0
0
200
400
600
800
1000
1200
1000
1200
200
human cytomegalo virus
400
600
800
1000
1200
dot plot matrix between HHV8 & HCMV
varicella-zoster virus
0
0
200
400
600
800
200
herpes simplex virus2
400
600
800
1000
1200
dot plot matrix between HSV2 & VZV
85
varicella-zoster virus
0
0
200
400
600
800
1000
1200
200
human cytomegalo virus
400
600
800
1000
1200
dot plot matrix between VZV & HCMV
Figure: 4.5. Results of sequence dot plot matrix among the entire sequences of their
capsid protein.
The results obtained in figure4.5 specify that:1. Capsid protein of herpes simplex1 is more similar to capsid protein of herpes
simplex2 and also capsid protein of varicella-zoster virus.
2. Capsid protein of herpes simplex2 is more similar to capsid protein of varicellazoster virus.
3. Capsid protein of Epstein barr virus is more similar to capsid protein of human
herpes virus8.
4. Capsid protein of human herpes virus6 is more similar to capsid protein of human
herpes virus7 and also capsid protein of human cytomegalo virus.
5. Capsid protein of human cytomegalo virus is more similar to capsid protein of
human herpes virus7.
86
2.The results obtained by using Needleman-Wunsch algorithm
In our work we use the following scoring matrices:1Blosum50
2Blosum30
Herpesviruses Family
Score by using
Blosum50
Identities
Score by using
Blosum30
Identities
HSV1& HSV2
2940.33
94%
1848.8
94%
HSV1 & EBV
537.333
29%
423.2
28%
HSV1 & HHV6
503
26%
392
26%
HSV1 & HHV7
480
25%
373
25%
HSV1& HHV8
573.667
31%
438.8
30%
HSV1& Varicella-Zoster Virus
1641.67
52%
1061.4
52%
HSV1& Human Cytomegalo Virus
468.667
28%
375.4
27%
EBV & HSV2
541.667
28%
431.4
28%
EBV& HHV6
747.667
31%
533.8
31%
EBV& HHV7
726.667
32%
517.4
31%
EBV& HHV8
1780
56%
1128.6
56%
EBV & Varicella-Zoster Virus
551.667
28%
412.8
27%
EBV & Human Cytomegalo Virus
722.667
30%
507.8
30%
HSV2 & HHV6
493.667
26%
392.2
26%
HSV2 & HHV7
472
24%
372.8
25%
HSV2& HHV8
568
30%
436.6
30%
HSV2 & Varicella-Zoster Virus
1635
52%
1060
52%
HSV2& Human Cytomegalo Virus
470.333
27%
377
27%
HHV6 & HHV7
2182.67
68%
1371.4
68%
HHV6& HHV8
707
30%
500.2
30%
HHV6& Varicella-Zoster Virus
466.667
26%
371.4
26%
HHV6 & Human Cytomegalo Virus
1400.33
44%
899.8
44%
HHV7 & HHV8
710.667
31%
498
31%
HHV7 & Varicella-Zoster Virus
458
27%
359
26%
HHV7 & Human Cytomegalo Virus
1395.33
43%
893
44%
HHV8 &Varicella-Zoster Virus
555
29%
412.4
29%
HHV8 & Human Cytomegalo Virus
711.333
30%
506
30%
Varicella-Zoster Virus & Human Cytomegalo Virus
432.667
26%
347.2
26%
Table 4.2.This table shows the score and the percentage of similarity of the pairwise global
alignment among the entire sequences of capsid protein of herpesvirus family.
87
The results obtained in this table ensure that specified in figure4.5.
4.7.2. Multiple sequence alignment (MSA) among the entire sequences
of the capsid protein of herpesvirus family.
Second we have made a code by using MATLAB program to create multiple sequence
alignment (MSA) among all the entire sequences of the capsid protein sequence. Multiple
sequence alignment (MSA) is used to ensure and specify the results obtained in table4.2.
The results are illustrated by using phylogenetic tree.
In our work we use the following scoring matrices:1. BLOSUM60
2. BLOSUM30
3. PAM10
a) Result obtain by using the scoring matrix BLOSUM60
MSA of herpesvirus family by using scoring matrix blosum60
EBV
EBV
HHV8
HHV8
Branch 3
Branch 4
HHV6
HHV6
Branch 5
Branch 6
Root
HHV7
HHV7
HCMV
HCMV
HSV2
HSV2
HSV1
HSV1
Branch 1
Branch 2
VZV
VZV
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Figure: 4.6. Results of MSA among the entire sequences of herpesvirus family by using
the scoring matrix BLOSUM60
88
b) Result obtain by using the scoring matrix BLOSUM30
MSA of herpesvirus family by using scoring matrix blosum30
EBV
EBV
HHV8
HHV8
Branch 3
Branch 4
HHV6
HHV6
Branch 5
Branch 6
Root
Branch 1
Branch 2
HHV7
HHV7
HCMV
HCMV
HSV2
HSV2
HSV1
HSV1
VZV
VZV
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Figure: 4.7. Results of MSA among the entire sequences of herpesvirus family by using
the scoring matrix BLOSUM30
c) Result obtain by using the scoring matrix PAM10
MSA of herpesvirus family by using scoring matrix PAM10
Branch 3
HHV8
HHV8
EBV
EBV
Branch 4
HHV6
HHV6
Branch 6
Branch 5
HHV7
HHV7
Root
HCMV
HCMV
Branch 1
Branch 2
HSV2
HSV2
HSV1
HSV1
VZV
VZV
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Figure: 4.8. Results of MSA among the entire sequences of herpesvirus family by using
the scoring matrix BLOSUM30
89
4.8.
conclusion
From the previous results obtained in figures from 4.5 to 4.8 and table 4.2 we can
classify the herpesvirus family according to the similarity of the structure of their capsid
protein into three categories as follows:1. The first category contain
a) Herpes simplex virus1 (HSV1)
b) Herpes simplex virus2 (HSV2)
c) Varicella-zoster virus VZV (HSV3)
2. The second category contain
a) Human herpesvirus6 (HHV8)
b) Epstein barr virus (EBV)
3. The third category contain
a) Human herpesvirus7 (HHV7)
b) Human herpesvirus8 (HHV6)
c) Human cytomegalo virus HCMV (HHV5)
Figure: 4.9. This figure represent the classification of herpesvirus family according to the
structure of their capsid protein
90
The differences in capsid gene activation and expression need to be studies in depth by
analysis it’s promoter region [39]. In addition, the structure similarity and evolutionary
relationship between those viruses family and Genomic Retrotransposon based on
structure [40, 41] and activation [42, 43, 44, 45] is currently under investigation.
These results accepted for publication with the title " Mathematical Modeling And
Classification Of Viruses From Herpesvirus Family " in an international journal called
international journal of computer applications (IJCA).
91
CHAPTER FIVE
CONCLUSION AND OUTLOOK
5.1.
conclusion
The main contributions are proposed in this thesis are as follow:I.
The pairwise alignment comparison of HSF1 among different studied eukaryotic
organism (e.g., Human, Taurus, Dania rerio, Mouse, Plant (Arabidopsis), Yeast)
shows that the human and Taurus are more closer on the entire length of the HSF1
using the scoring matrix BLOSUM30 and BLOSUM50 as shown in table3.2
However, the HSF1 conserved domain (HSF_DNA-bind) sequence was more
similar between human and mouse comparing to others by using the same blosum
matrices as shown in table3.1. Similar results are obtained using multiple sequence
alignment. As shown in table3.3 results obtained by using gene tracer algorithm
ensured that the conserved domain (HSF_DNA-bind) in mouse is the same in
human and the entire sequence of HSF1 protein in human is the same as in Taurus
also we can show clearly the related parts between the sequences.
II.
One important result is that, if we make a comparison between the results obtained
in table3.1 and table3.2 we can specify that for all eukaryotic organisms we are
used the degree of similarity in case of the conserved domain (HSF_DNA-bind) is
more than the degree of similarity in case of entire sequence of HSF1 protein.
III.
From the previous results obtained in figures from 4.5 to 4.8 and table 4.2. we can
classify the herpesvirus family according to the similarity of the structure of their
capsid protein into three categories as follows:1. The first category contain
a) Herpes simplex virus1 (HSV1)
b) Herpes simplex virus2 (HSV2)
c) Varicella-zoster virus VZV (HSV3)
92
2. The second category contain
a) Human herpesvirus6 (HHV8)
b) Epstein barr virus (EBV)
3. The third category contain
a) Human herpesvirus7 (HHV7)
b) Human herpesvirus8 (HHV6)
c) Human cytomegalo virus HCMV (HHV5)
The same results obtained by biologist, but according to another criteria rather than the
similarity of the structure in capsid protein.
5.2.
OUTLOOK
The genetic evolution that occur among different types of HSF in human (e.g., HSF1,
HSF2, HSF4, and HSF5), and HSF1 in eukaryotic organisms (e.g., Human, Taurus, Dania
rerio, Mouse, Plant (Arabidopsis), Yeast) need to be studied.
Also the differences in capsid gene activation and expression need to be studies in depth
by analysis it’s promoter region. In addition, the structure similarity and evolutionary
relationship between those viruses family and Genomic Retrotransposon based on
structure and activation is currently under investigation.
93
REFERENCES
[1]
Angelov, S.P " Pattern Discovery in Biological Data Set" Ph. D. Thesis,
Pennsylvania University, 24-34, (2007).
[2]
Wu, X " Improving the Performance and Precision of Bioinformatics Algorithms"
Ph. D. Thesis, University of Maryland, 1-13, (2008).
[3]
Sinha, S " Algorithms for Finding Regulatory Motifs in DNA Sequences" Ph. D.
Thesis, Department of Computer Science and Engineering, University of
Washington, 2-16, (2002).
[4]
Mona snigh "Topics in computational molecular biology", lecture2, in September
22, (1999).
[5]
Gusfield, D " Algorithms on Strings, Trees, and Sequences" Computer Science and
Computational Biology. Cambridge University Press, (1997).
[6]
Richard C. Deonier, Simon Tavaré, Michael S. Waterman "Computational Genome
Analysis An Introduction", chapter6, Springer, (2003).
[7]
Nello Cristianini and Matthew W. Hahn "Introduction to Computational Genomics
A Case Studies Approach" chapter3, Cambridge University Press, (2006).
[8]
M. O. Dayhoff, R. M. Schwartz, B. C. Orcutt " A model of evolutionary change in
proteins, in Atlas of Protein Sequence and Structure", chapter 22, National
Biomedical Research Foundation, Washington, DC: p345–358, (1978).
[9]
S. Henikoff, J. G. Henikoff "Amino acid substitution matrices from protein blocks"
Proc. Natl. Acad. Sci. USA, Vol. 89, N°22: p10915- 10919, (1992).
[10]
C. S. B Needleman, C. D. Wunsch, "A general method applicable to the search for
similarities in the amino acid sequence of two proteins". Journal of molecular
biology,vol. 48, no. 1, pp. 443-453. (1970).
94
[11]
X. Huang, K. M. Chao, "A generalized global alignment
Bioinformatics", Vol. 19, N°2: p228– 233, (2003).
algorithm,
[12]
R. A. Cartwright: Ngila, "global pairwise alignments with logarithmic and affine
gap costs", Bioinformatics, Vol. 23, N°11:, p1427–1428. (2007).
[13]
T. F. Smith, M. S. Waterman, "Identification of common molecular subsequences",
J. Molecular Biology, no. 147, pp. 195-197, (1981).
[14]
M.Eissa, A.M.Alzohairy, H.Abobakr, I.Zidan, "Gene- Tracer: Algorithm Tracing
Genes Modification from Ancestors through Offsprings". International Journal of
Computer Applications (0975 – 8887) Volume 52– No.19, August (2012).
[15]
Thompson J. D., Higgins D. G., Gibson T. J., CLUSTAL W" improving the
sensitivity of progressive multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice" Nucleic Acids Res, vol.
22. pp. 4673–4680, (1994).
[16] Report "IUCN Red List of Threatened Species". Internet address:
http://www.iucnredlist.org/about/summary-statistics. (2010).
[17]
Maddison, D. R. and K.-S. Schulz (eds.) "The Tree of Life Web Project". Internet
address: http://tolweb.org. (2007).
[18]
C. S. B Needleman, C. D. Wunsch, "A general method applicable to the search for
similarities in the amino acid sequence of two proteins". Journal of molecular
biology, vol. 48, no. 1, pp. 443-453. (1970).
[19]
Sorger et al. "Stress-induced oligomerization and chromosomal relocalization of
heat-shock factor". Nature 353, 822 - 827 (31 October 1991);
doi:10.1038/353822a0. (1991).
[20]
Morimoto, R. I. " Regulation of the heat shock transcriptional response:cross talk
between a family of heat shock factors, molecular chaperones, andnegative
regulators". Genes Dev. 12, 3788-3896. (1998).
95
[21]
Nover L, "Expression of heat shock genes in homologous and heterologous
systems. Enzyme Microb". Technol. 9, 130–144. (1987).
[22]
Yokotani N, Ichikawa T, Kondou Y, Matsui M, Hirochika H, Iwabuchi M, Oda K,
"Expression of rice heat stress transcription factor OsHsfA2e enhances tolerance to
environmental stresses in transgenic Arabidopsis". Planta. 1432-2048. (2007).
[23]
Mohammed M. Saleh, Ahmed M. Alzohairy, Osama Abdo Mohamed, Gaber H.
Alsayed "A Comprehensive Study by Using Different Alignment Algorithms to
Demonstrate the Genetic Evolution of Heat Shock Factor 1 (HSF1) in Different
Eukaryotic Organisms". IRACST – Engineering Science and Technology: An
International Journal (ESTIJ), ISSN: 2250-3498, Vol.3, No.2 Pages:376-382,
(2013).
[24]
Amon, Wolfgang; Farrell (November 2004). "Reactivation of Epstein–Barr virus
from latency". Reviews in Medical Virology 15 (3): 149–56. doi:10.1002/rmv.456.
PMID 15546128. Retrieved 28 May (2012).
[25]
Benjamin D.C., Kander,R. J., Volk, W,A. "essential of medical microbiology" 4th
edition. J.B. Lippincott Company, Philadelphia. (1991).
[26]
Richard A. Harvey, Pamela C. Champe, Bruce D. Fisher "Lippincott’s
microbiology 2th Edittion" Section 4 Chapter 25, Lippincott Williams & wilkins
(2007).
[27]
Ryan KJ; Ray CG (editors) "Sherris Medical Microbiology (4th ed.) ". McGraw
Hill. ISBN 0-8385-8529-9. (2004).
[28]
Mettenleiter et al. "Molecular Biology of Animal Herpesviruses". Animal Viruses:
Molecular Biology. Caister Academic Press. ISBN 1-904455-22-0. .
[http://www.horizonpress.com/avir], (2008).
[29]
Sandri-Goldin RM (editor). " Alpha Herpesviruses: Molecular and Cellular
Biology". Caister Academic Press. ISBN 978-1-904455-09-7. (2006).
96
[30]
Chayavichitsilp P, Buckwalter JV, Krakowski AC, Friedlander SF. "Herpes
simplex". Pediatr Rev 30 (4): 119–29; quiz 130. doi:10.1542/pir.30-4-119. PMID
19339385. April (2009).
[31] In the United States, as many as 95% of adults between 35 and 40 years of age have
been
infected.
National
Center
for
Infectious
Diseases.
http://www.cdc.gov/ncidod/diseases/ebv.htm
[32]
Staras SA, Dollard SC, Radford KW, Flanders WD, Pass RF, Cannon MJ
(November 2006). "Seroprevalence of cytomegalovirus infection in the United
States, 1988–1994". Clin. Infect. Dis. 43 (9): 1143–51. doi:10.1086/508173. PMID
17029132. Retrieved (2009).
[33]
John Carter, Venetia Saunders. "Virology, Principles and Applications". John
Wiley & Sons. ISBN 978-0-470-02386-0.
[34]
Jay C. Brown, William W. Newcomb. "Herpesvirus Capsid Assembly: Insights
from Structural Analysis". Current Opinion in Virology 1 (2): 142–149. (2011).
[35]
Madigan M, Martinko J "Brock Biology of Microorganisms" (11th ed.). Prentice
Hall. ISBN 0-13-144329-1, (2006).
[36]
N. Komarova, D. Wodarz, "ODE models for oncolytic virus dynamics", J of
Theoretical Biology, 263530-543, (2010).
[37]
Y Wang, JP Tian, J Wei. "Lytic cycle: A defining process in oncolytic virotherapy.
Applied Mathematical Modelling", Vol. 37, Issue 8, Pages 5962–5978, 15 April
(2013).
[38]
Gaber H. Alsayed, Ahmed M. Alzohairy, Osama Abdo Mohamed, Mohamed M.
Saleh. "Mathematical Modeling And Classification Of Viruses From Herpesvirus
Family ". computers in biology and medicine: An International Journal, (2014).
[39]
Alzohairy. A. Mansour, Margaret H. MacDonald, Benjamin F. Matthews (2013).
The pJan25 vector series: An enhancement of the gateway-compatible vector
pGWB533 for broader promoter testing applications. Plasmid, 69(3):249-56,
(2013).
97
[40]
Alzohairy, A. Mansour. Gábor Gyulai, Jansen RK. A. Bahieldin (2013)
Transposable Elements Domesticated and neofunctionalized by Eukaryotic
Genomes PLASMID. 69 (2013) 1–15, (2013).
[41]
Mansour, A. "Utilization of Genomic Retrotransposon as cladistic molecular
markers". Journal of Cell and Molecular Biology. 7(1): 17-28, (2008).
[42]
Mansour, A. "Epigenetic activation of Genomic Retrotransposon". Journal of Cell
and Molecular Biology. 6 (2): 99-107, (2007).
[43]
Mansour A. "Water Deficit Induction of Copia and Gypsy Genomic
Retrotransposons", Plant Stress 3(1)33-39, (2009).
[44]
Alzohairy, A. Mansour , Mohamed A Yousef, Sherif S Edris, Balázs Kerti and
Gábor Gyulai "Detection of long terminal repeat (LTR) retrotransposons
reactivation induced by in vitro environmental stresses in barley (Hordeum vulgare)
via reverse transcription-quantitative polymerase chain reaction (RT-qPCR)". Life
Science Journal (2012; 9(4): 5019-5026, (2012).
[45]
Ahmed Alzohairy, J.S.M. Sabir, Gabor Gyulai, Rania Younis, Robert Jansen,
Ahmed Bahieldin "Environmental Stress Activation of Plant LTRRetrotransposons". Functional Plant Biology. (In press 2014).
98
99