Zagazig University Faculty of Science Mathematics Department USING SEQUENCE ALIGNMENT ALGORITHMS IN SOME APPLICATIONS IN BIOINFORMATICS A Thesis Submitted to Department of Mathematics, Faculty of Science, Zagazig University, Egypt. In Partial Fulfillment of Requirements of Award of M.Sc. Degree (Mathematics and computer science) By Gaber Hassan Al-sayed Ahmed Abdelaal B. Sc. (Mathematics and computer science (2007) Zagazig University, Egypt) Teaching Assistant at Basic Science Department, Faculty of Engineering, Sinai University, Egypt. 2014 بسم هللا الرحمن الرحيم ﴿ اّلل َعلَ ْي ِه َت َو َّك ْل ُ ِيب َو َما َت ْوفِيقِي إِالَّ ِب ه ِ ت َوإِلَ ْي ِه أُن ُ صدق هللا العظيم سورة هود األيه 88 بسم هللا الرحمن الرحيم ﴿ َو َما أُو ِتي ُتم مِنَ ال ِعلم إِ اَّل َقلِ ً يل ِ صدق هللا العظيم سورة اإلسراء األيه 88 ﴾ ﴾ ACKNOWLEDGMENT In the beginning, All praise due to ALLAH AL-WAHAB WHO guided and aided me to bring forth this thesis, without this it would not have been possible. I would like to take this opportunity to thank all those who helped me, and contributed in many different ways. To begin with I think it is rather fitting to acknowledge Dr Ahmed Mansour Alzohairy, Dr Osama Abdo Mohamed, and Dr Mohamed Hussein Saleh. It has been a pleasure to work with them, their expertise and guidance have proven invaluable. That helped me while working in this thesis. Without their assistance, valuable discussions and critical reading of the manuscript, this work could not have been accomplished. They taught me a lot in the branch of my study and show me the way how the problems could be solved. Furthermore, I would like to express my thanks to all members of Department of Mathematics, Faculty of Science Zagazig University for their constant help. Finally, I wish to express my deep thanks and gratitude to my parents, my sisters, for their encouragement during the preparation of this thesis. Gaber. H. Alsayed ( 2014) i PUBLICATION 1. A Comprehensive Study by Using Different Alignment Algorithms to Demonstrate the Genetic Evolution of Heat Shock Factor 1 (HSF1) in Different Eukaryotic Organisms, Published in IRACST – Engineering Science and Technology: An International Journal (ESTIJ), 3(2), (April 2013):376. 2. Mathematical Modeling And Classification Of Viruses From Herpesvirus Family, published in an International Journal of Computer Applications (IJCA), Volume 87 – No.12, February 2014. ii ABSTRACT Bioinformatics is a pluri-disciplinary science focusing on the applications of computational methods and mathematical statistics to molecular biology. Bioinformatics is also called: 1) Computational Biology (USA) 2) Computational Molecular Biology 3) Computational Genomics Studying bioinformatics is so important because it is an opportunity to use some of the most interesting computational techniques to understand some of the deep mysteries of life and diseases and hopefully to contribute to cure some of the diseases that affect people. Bioinformatics combines algorithms in computer science and statistics to analyze, understand, and engage in hypothesis about the large repositories of collected biological data and knowledge. One of the most important topics in bioinformatics is sequence alignment or sequence comparison which we are concerned with it in this thesis. In comparison of bimolecular sequence (i.e., those of DNA, RNA, and protein), regions of high sequence similarity often indicate significant functional or structural similarity as well. The same and related molecular structures and mechanisms are reused and modified during evolution, and thus show up repeatedly within either a single genome or across the genome of a wide variety of species. As a result, sequence comparison is the most commonly used method for inferring structure and biological function. Of course, sequence can have similar structure and function without exhibiting sequence similarity. iii Sequence comparison is also the first step for many problems in computational biology, such as evolutionary tree reconstruction, genome analysis, and classification of viruses that belong to specific family of viruses. One of the most important methods to understand the life cycle of viruses is to construct mathematical model describing through it the six steps lytic cycle of a virus. iv CONTENTS Subject Page Acknowledgment…………………………………………………………………. i Publications………………………………………………………………………. ii Summary…………………………………………………………………………. iii Contents…………………………………………………………………………... vii List of abbreviations……………………………………………………………... x List of figures……………………………………………………………………... xi List of tables……………………………………………………………………… Xiv CHAPTER ONE FUNDAMENTAL CONCEPTS Biological Background………………………………………………………….. 1 Introduction………………………………………………………………………. 1 Bioinformatics cycle……………………………………………………………… 2 Nucleic Acids……………………………………………………………………... 3 DNA………………………………………………………………………………... 4 RNA………………………………………………………………………………... 5 Protein……………………………………………………………………………... 6 Sequence Analysis………………………………………………………………… 8 Sequence Alignment……………………………………………………………… 8 Pairwise Sequence Alignment……………………………………………………. 9 Multiple Sequence Alignment……………………………………………………. 11 Basic Definitions on String………………………………………………………. 12 Motivations………………………………………………………………………… 13 Objectives…………………………………………………………………………... 13 v CHAPTER TWO SEQUENCE ALIGNMENT ALGORITHMS Pairwise Sequence Alignment by Dynamic Programming……………………... 14 Global Sequence Alignment………………………………………………………. 15 Global sequence alignment algorithm (Needleman and Christian Wunsch Algorithm)…………………………………………………………………………... 22 Local sequence alignment…………………………………………………………. 24 Local sequence alignment algorithm (Smith and Waterman Algorithm)……….. 25 Gene Tracer algorithm……………………………………………………………. 27 Multiple Sequence Alignment algorithms………………………………………... 29 CHAPTER THREE USING SEQUENCE ALIGNMENT ALGORITHMS TO DEMONSTRATE THE GENETIC EVOLUTION OF HEAT SHOCK FACTOR 1 (HSF1) IN DIFFERENT EUKARYOTIC ORGANISMS Overview…………………………………………………………………………… 31 Introduction………………………………………………………………………... 31 Heat shock factor protein…………………………………………………………. 33 Biological data used in our work…………………………………………………. 34 Results and discussion…………………………………………………………….. 38 Pairwise global alignment among the common conserved domains (HSF_DNA-bind) and also among the entire sequence of HSF1 protein…….. 38 Multiple sequence alignment (MSA) among all the conserved domains (HSF_DNA-bind) in HSF1 protein sequences and also among all the entire sequences of HSF1 protein………………………………………………………... 46 Using Gene Tracer algorithm to specify some results obtained in table3.1and table3.2……………………………………………………………………………... 54 Conclusion………………………………………………………………………….. 57 vi CHAPTER FOUR MATHEMATICAL MODELING AND CLASSIFICATION OF VIRUSES FROM HERPESVIRUS FAMILY Overview…………………………………………………………………………… 58 Introduction to viruses…………………………………………………………….. 58 Enveloped DNA viruses…………………………………………………………… 59 Herpesviridae (herpesviruse) family……………………………………………... 60 Biological data used in our work…………………………………………………. 61 Mathematical modeling of herpesvirus family life cycle………………………... 61 Results and discussion…………………………………………………………….. 72 Pairwise global alignment among the entire sequences of herpesvirus family… 72 Multiple sequence alignment (MSA) among the entire sequences of the capsid protein of herpesvirus family……………………………………………. 88 Conclusion………………………………………………………………………….. 90 CHAPTER FIVE CONCLUSION AND OUTLOOK Conclusion………………………………………………………………………….. 92 Outlook …………………………………………………………………………….. 93 References………………………………………………………………………… 94 vii LIST OF ABBREVIATIONS Deoxyribonucleic Acid DNA Ribo Nucleic Acid RNA Ribosomal Ribo Nucleic Acid rRNA Messenger Ribo Nucleic Acid mRNA Transfer Ribo Nucleic Acid tRNA Adenine A Guanine G Cytosine C Thymine T Uracil U Dynamic Programming DP Needleman-Wunsch NW Smith-Waterman SW Multiple Sequence Alignment MSA Blocks Substitution Matrix BLOSUM Percent Accepted Mutation PAM Heat Shock HS Heat Shock Factor HSF National Center for Biotechnology Information NCBI Uni protein knowledge database UniProtKB Herpes simplex virus1 HSV1 Herpes simplex virus2 HSV2 Epstein barr virus EBV Human herpesvirus6 HHV6 Human herpesvirus7 HHV7 Human herpesvirus8 HHV8 Varicella-zoster virus VZV (HSV3) Human cytomegalo virus HCMV (HHV5) viii LIST OF FIGURES Fig. No. Fig. 1.1 Subject Page The use and development of mathematical algorithms and computer programs to obtain insight into biological and 2 medical systems Fig. 1.2 Bioinformatics cycle 3 Fig. 1.3 DNA Structure 4 Fig. 1.4 RNA Structure 5 Fig. 1.5 Different Graphical Representations of Proteins 6 Fig. 1.6 From DNA to Protein (central dogma of molecular biology) 7 Fig. 1.7 Dot matrix comparison of sequences s1 =tgagaaaatgctttagcacggctgg and s2 = aaatgctttgagcac 9 Fig. 1.8 Filtered dot matrix comparison of sequences s1 = tgagaaaatgctttagcacggctgg and s2 = aaatgctttgagcac 10 Fig. 2.1 Matrix sim for scoring alignment of s = AAAC and t = AGC 18 Fig. 2.2 Local Alignment Example 26 Fig. 2.3 Gene Tracer Function 28 Fig. 3.1 Structure of Heat Shock Factor Protein 1 Dna Binding Domain From Homo Sapiens 33 Fig. 3.2 Results of sequence dot plot matrix among the conserved domains (HSF_DNA-bind) in HSF1 protein sequences 41 ix Fig. No. Fig. 3.3 Subject Page Results of sequence dot plot matrix among the entire sequences of HSF1 protein 45 Graphical summary of the conserved domains of HSF1 protein Fig. 3.4 in the different organisms human, Danio rerio, Taurus, mouse, yeast, and plant This figure shows the similarity between the conserved Fig. 3.5 domains in HSF1 protein sequences in all the last species in case of MSA by using the scoring matrix blosum60 This figure shows the similarity between the conserved Fig. 3.6 domains in HSF1 protein sequences in all the last species in case of MSA by using the scoring matrix blosum80 This figure shows the similarity between the conserved Fig. 3.7 domains in HSF1 protein sequences in all the last species in case of MSA by using the scoring matrix PAM10 This figure shows the similarity between all sequences of the Fig. 3.8 last species in case of HSF1 protein sequence by using the scoring matrix blosum60 This figure shows the similarity between all sequences of the Fig. 3.9 last species in case of HSF1 protein sequence by using the scoring matrix blosum80 This figure shows the similarity between all sequences of the Fig. 3.10 last species in case of HSF1 protein sequence by using the scoring matrix PAM1 47 48 49 50 51 52 53 Fig. 3.11 Results of Gene Tracer program in the case of the conserved domain sequence (HSF_DNA-bind) 55 Fig. 3.12 Results of gene tracer program in the case of the entire sequence of HSF1 protein sequence 56 Fig. 4.1 Virus Structure components 58 Fig. 4.2 General DNA Virus structure 59 Fig. 4.3 Simple diagram represent herpesviruses Lytic Cycle 61 Fig. 4.4 The solution to system of equations produced by ODE45 63 x Fig. No. Subject Page Fig. 4.5 Results of sequence dot plot matrix among the entire sequences of their capsid protein 86 Fig. 4.6 Results of MSA among the entire sequences of herpesvirus family by using the scoring matrix BLOSUM60 88 Fig. 4.7 Results of MSA among the entire sequences of herpesvirus family by using the scoring matrix BLOSUM30 89 Fig. 4.8 Results of MSA among the entire sequences of herpesvirus family by using the scoring matrix PAM10 89 Fig. 4.9 This figure represent the classification of herpesvirus family according to the structure of their capsid protein 90 xi LIST OF TABLES Table No. Table 2.1 Table 3.1 Table 3.2 Subject A simple scoring function Page 15 Results of pairwise global alignment in case of the conserved domains (HSF_DNA-bind) Results of pairwise global alignment in case of the entire length of the HSF1 sequences 42 46 Table 3.3 Results of gene-tracer algorithm 54 Table 4.1 The modeling steps symbols 62 Table 4.2 Results of pairwise global alignment of the entire length of herpesviruse family sequences 87 xii CHAPTER ONE FUNDAMENTAL CONCEPTS In this chapter we introduce the fundamental concepts that are needed to study sequence alignment problem in bioinformatics. This chapter consists of six sections In Section 1.1 we shed light on the fundamental facts of molecular biology. In Section 1.2 we introduce a brief description of sequence analysis problem. In section 1.3 we shed light on the fundamental facts of sequence alignment (pairwise and multiple sequence alignment). After that in section 1.4 we introduce some basic definitions on sequence and string. Finally in both sections 1.5 and 1.6 the motivations of sequence alignment studying, and the objectives of our thesis are illustrated respectively. 1.1. Biological Background 1.1.1. Introduction Bioinformatics science is a mixture of many other sciences such as biology, statistics, computer science, and mathematics as in figure 1.1 to process biological data. Databases and information systems are used to store and organize biological data. Analyzing biological data may involve algorithms in artificial intelligence, soft computing, data mining, image processing, and simulation. Biological data may be nucleic acids (Deoxyribonucleic acid "DNA" or Ribonucleic acid "RNA") sequences or protein sequences. 1 Figure: 1.1. The use and development of mathematical algorithms and computer programs to obtain insight into biological and medical systems. 1.1.2. Bioinformatics cycle Bioinformatics cycle consists of three stages as shown in figure 1.2. In the first stage, biological data, which represent information’s, are extracted from biological experiments in life science. In the second stage biological data are stored in biological database. In the third stage several operations may be performed on biological sequences that retrieved from database such as statistical analysis, visualization, prediction and modeling using information science techniques. The ultimate goal of statistical bioinformatics is to statistically identify significant changes in biological processes for the purpose of answering biological questions. For example if the biologist need to determine the genetic evolution that happen in a specific gene in different species sequence alignment algorithms can be used for this purpose as it shown in chapter three. 2 Figure: 1.2. Bioinformatics cycle. 1.1.3. Nucleic Acids DNA and RNA are usually collectively referred to as nucleic acids since in eukaryotic cells, DNA and RNA occur predominantly in the nuclei. Nucleic acid is composed of chains of nucleotides. In DNA, there are two chains (strands) forming a double helix, see Figure 1.3, while in RNA, there is only one strand, see Figure 1.4 [1]. The building unit of either DNA or RNA is the nucleotide. Each nucleotide consists of three components: a nitrogenous base (nucleobase), a pentose sugar (deoxyribose in case of DNA and ribose in case of RNA) and a phosphate group as shown in figure 1.3. Special attention will be given to the nucleobase component. There are two kinds of nitrogenous bases: purines and pyrimidines. Two types of purines and three types of pyrimidines are 3 commonly found in nucleic acids. The two purines are adenine and guanine abbreviated A and G. The three pyrimidines are cytosine, thymine and uracil abbreviated C, T, and U. Both DNA and RNA contain A, C, and G; only DNA contains the base T, whereas only RNA contains the base U. 1.1.3.1. DNA The DNA contains the basic genetic information that each organism needs to live and reproduce. In DNA there is always hydrogen bond between adenine (A), and thymine (T) also between cytosine (C) and guanine (G). The complete set of DNA that contains the entire hereditary information of an organism is called the genome of the organism [2]. The nucleotides of the DNA sequence can be roughly divided into two groups. The first group consists of all nucleotides building the genes, and the second group is the junk DNA, nucleotides that have no or unknown functions. Figure: 1.3. DNA Structure. 4 1.1.3.2. RNA The structure of RNA is similar to DNA, with two important differences the first difference is that DNA consist of a double chains (strands) forming double helix, but RNA consist of one chain, the second difference is that instead of the connection between thymine (T) and adenine (A) in DNA it become between adenine (A) and Uracil (U) in RNA , this means that RNA chain consist of the four bases adenine (A), Uracil (U), cytosine (C) and guanine (G). Figure: 1.4. RNA Structure. 5 1.1.4. Protein A protein is a polypeptide chain polymer, which formed out of twenty different kinds of amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, M, Y). This molecule exists in a specifically folded structure that is necessary for its function. Thousands of types of proteins are present in the different cells of an organism, each protein with a distinct amino-acid sequence and 3-dimensional structure, see Figure 1.5, [3]. Figure: 1.5. Different Graphical Representations of Proteins. 6 The process of generating protein from DNA is called gene expression it occurs into two major steps. The first is transcription, where the information coded in DNA is copied into a molecule of RNA whose bases are complementary to those of the DNA. The second is translation, where the information now encoded in RNA is translated into instructions for manufacturing a protein utilizing the ribosome protein machine, this process is shown in figure 1.6. Figure: 1.6. From DNA to Protein (central dogma of molecular biology) 7 1.2. Sequence Analysis One of the most important operations in bioinformatics is sequence analysis. It means analyzing DNA or protein sequence to understand their features, functions, and structures. Sequence alignment or sequence comparison is one of the most important methods in performing sequence analysis process. 1.3. Sequence Alignment At first we need to know what we mean by sequence alignment. By writing the sequences of two genes as a strings of characters, with one string above the other, we can determine at which positions the strings do or do not match this process is called alignment or sequence alignment. Sequence alignment involves establishing correspondences between bases of DNA or RNA strings or between amino acids forming linear sequences in proteins. Aligning DNA, RNA or amino acid sequences is of basic importance in bioinformatics and can be used for a variety of research purposes. It can find similarity between two DNA sequences resulting from the existence of a recent common ancestor, which these two sequences originate from, the process of aligning two sequences is called pairwise alignment. By measuring or computing distances between the aligned sequences, one draws inferences about the evolutionary processes they have gone through. This inference about the evolutionary process may involve estimating the time that has passed from the common ancestor to the present, but may also involve stating hypotheses or reconstructing a single evolutionary event in the past or a sequence of them. Aligning two sequences can allow one to detect their overlap or to notice that one sequence is a part of the other or that the two sequences share a subsequence. Instead of two sequences, one can also align many sequences or match a sequence against a DNA, RNA, or protein database, this process is known as multiple sequence alignment (MSA). 8 1.3.1. Pairwise Alignment Pairwise sequence alignment methods are used to find the best-matching piecewise (local) or global alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high similarity to a query). The four primary methods of producing pairwise alignments are: 1. By hand methods (slide sequences on two lines of a word processor). 2. Dot-matrix or dot plot method which implicitly produces a family of alignments for individual sequence regions. The dot matrix or dot plot is a simple and very useful concept for aligning two sequences. Assume that the sequences to be aligned are s1 = tgagaaaatgctttagcacggctgg and s2 = aaatgctttgagcac. Form a rectangular n × m matrix with rows corresponding to the characters in the first string s1 and columns corresponding to the characters in the second string s2, such that the order of characters is to the right and down. Place a dot in each matrix entry, where a base from s1 matches a base from s2. The result, shown in Fig 1.7 is called a dot matrix. Figure: 1.7. Dot matrix comparison of sequences s1 =tgagaaaatgctttagcacggctgg and s2 = aaatgctttgagcac. The dots show possible correspondences between the characters of the strings s1 and s2 9 The dots show possible correspondences between the characters of the strings s1 and s2. In a dot plot each diagonal corresponds to a possible (ungapped) alignment. There are many dots related to accidental matches between letters of the two strings. We can eliminate some of these by removing dots unlikely to represent a nonrandom correspondence between characters of the strings s1 and s2 with the use of some intuitive criterion. If we introduce the requirement that, in order that a dot is not removed, there must be at least k neighboring matches along the right-down diagonal direction, then this will result in some of the random accidental matches being filtered out. If k is too small, many accidental matches will remain in the dot matrix plot. On the other hand, if it is too large, some of the true correspondences between strings may be unintentionally omitted. If we take k = 3 we obtain the filtered dot matrix shown in Fig 1.8 which is much easier to interpret than the original one. Figure: 1.8. Filtered dot matrix comparison of sequences s1 = tgagaaaatgctttagcacggctgg and s2 = aaatgctttgagcac. The dots are now arranged in diagonal paths, which more clearly show the possible correspondences between the characters of the strings s1 and s2 . 10 From the filtered dot matrix, can we construct the following alignment between s1 and s2 s1 = aaatgctttgagcac ::::::::::: :::::: s2 = t g a g a a a a t g c t t t − a g c a c g g c t g g Using dot matrices is rather intuitive, since the alignment is performed by following long lines of dots in the plot. Nevertheless, there is a scoring system behind it. For example, we may assign a score of 1 for every single match between letters of strings, and we should not introduce indels unless it results in a large enough number of new scores. We should also penalize correspondence between mismatching symbols. 3. Rigorous mathematical methods (Dynamic Programming), (slow but optimal) in this approach we use two important algorithms the first is called NeedlemanWunsch algorithm (for global pairwise alignment), the second is SmithWaterman algorithm (for local pairwise alignment). These two algorithms will be discussed in details in chapter two. 4. Heuristic algorithms (faster but approximate): a. BLAST b. FASTA 1.3.2. Multiple Sequence Alignment Multiple sequence alignment is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. Alignments are also used to aid in establishing evolutionary relationships by constructing phylogenetic trees. 11 1.4 Basic Definitions on String As discussed earlier, DNA, RNA, and protein are represented as strings in bioinformatics. Therefore, in this section, we give some basic definitions that are needed in string processing [4]. 1. Alphabet - set of allowable symbols. Examples of biosequence alphabets: Σ= {A, C, G, T} (DNA) Σ= {A, C, G, U} (RNA) Σ= {A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} (Proteins) 2. Sequence (or string) - finite succession of characters chosen from an alphabet e.g., ATCCGAACTTG from the DNA alphabet Σ= {A, C, G, T} . 3. Subsequence - sequence obtained from a sequence by removal of characters e.g., TTT is a subsequence of ATATAT AAAA is not a subsequence of ATATAT 4. Substring - subsequence of consecutive characters e.g., TAC is a substring of ACTACA TAC is not a substring of ATGAC 5. Length of string[5] - is the number of characters in the string. It could be any non-negative integer. For example, if Σ = {A, C, G, T}, then ACGATGGGT is a string over Σ with length equals 9. 6. Prefix - substring containing the first character of a string, including the empty string e.g., in the string ACT, a prefix may be the empty string, A, AC, or ACT 7. Suffix – is a substring containing the last character of a string (includes the empty string). e.g., in the string TAC, a suffix may be the empty string, C, AC, TAC. 12 1.5 Motivations Our motivations of studying sequence comparisons are due to its importance of these comparisons in molecular biology. These comparisons are important for a number of reasons [6]: 1. They can be used to establish evolutionary relationships among different organisms. 2. Comparisons may allow Prediction of protein function and structure. 3. Such comparisons between humans and other species may identify corresponding genes in model organisms, which can be genetically manipulated to develop models for human diseases. 1.6 Objectives The objective of this thesis is using sequence alignment algorithms in some applications in molecular biology such as demonstrating the genetic evolution of a certain protein called heat shock factor1 (HSF1) and also the genetic evolution of conserved domains in this protein in different eukaryotic organisms, also classifying viruses that belong to one of the most important family in the families of the enveloped DNA viruses called herpesviruse family as it contain many dangerous viruses for human health. Mathematical model for the herpesviruses virus general life cycles also introduced. 13 CHAPTER TWO SEQUENCE ALIGNMENT ALGORITHMS In this chapter we explain in details the algorithms that are used in our applications. This chapter consists of three sections. In Section 2.1 we discuss in details with examples how we can make Pairwise Sequence Alignment (global and local) by Dynamic Programming by using both algorithms Needleman And Christian Wunsch Algorithm (NW) for global alignment and Smith and Waterman Algorithm (SW) for local alignment. Section 2.2 discusses an algorithm based in Smith and Waterman Algorithm called gene tracer algorithm. Finally in section 2.3 we discuss the multiple sequence alignment and CLUSTALW algorithm. 2.1. Pairwise Sequence Alignment by Dynamic Programming As we mentioned before in section 1.3.1 there are more than one method for performing Pairwise Sequence Alignment, in this thesis we will focus on DP (Dynamic Programming) methods which consumes long time of execution but gives highly accurate alignment. In mathematics, computer science and economics, dynamic programing is a method for solving complex problems by breaking them down into simpler sub problems. The idea behind dynamic programming is to solve these sub problems then combine the solutions of the sub problems to reach the overall solution. Often, many of these sub problems are really the same. The dynamic programming approach seeks to solve each sub problems only once, thus reducing the number of computations, once the solution to a given sub problems has been computed. It is stored or "memorized" the next time the same solution is needed it's simply looked up. This approach is especially useful when the number of repeating sub problems grows exponentially as a function of the size of the input. There are two important algorithms used in DP pairwise sequence alignment: 1. Needleman and Christian Wunsch Algorithm (for pairwise global sequence alignment). 2. Smith and Waterman algorithm (for pairwise local sequence alignment). 14 2.1.1. Global Sequence Alignment Pairwise global sequence alignment is the process of aligning all the entire length of two sequences. The global alignment problem is to find the best alignment according to some scoring function that measures similarity. Given two strings of different length, an alignment makes these sequences the same length through the insertion of gaps. Gaps may be added anywhere within the sequences, including at the beginning or the end, but cannot be aligned together. As an example, take two strings, CGACCTA and CGCCTA. One alignment is as follow CGACCTA CG─CCTA Here we inserted a space (or gap) to generate a good alignment. The simple scoring function in table2.1 can be used to calculate the total score of this alignment Column Score Match +1 Mismatch -1 Gap -2 Table 2.1: A simple scoring function 15 In this simple scoring function, a column containing two identical characters (a match) would receive a score of 1, a column containing two different characters (mismatch) would receive a score of -1, and a column containing a gap would receive a score of -2. This simple scoring function can be used to form what is known as substitution matrix. Substitution matrix is a convenient way of representing many scoring functions. In general it used to show the cost of replacing one letter (of either a nucleotide or amino acid alphabet) with another letter or a gap. Substitution matrices can be represented without the gap character, but if we need them in the alignment process we can include a column and row for gaps. According to the simple scoring function in table 2.1 we can form the following 5 x 5 Substitution matrix for this example ─ A C G T A +1 -1 -1 -1 C -1 +1 -1 -1 -2 G -1 -1 +1 -1 -2 T -1 -1 -1 +1 -2 ─ -2 -2 -2 -2 -2 N/D Note that the value of aligning gap with a gap is not defined- because we don’t need to align gaps with each other in this example. Using the simple scoring function, in the previous example then the total score is 1 +1 -2 +1 +1 +1 +1 = 4 But in the following example the total score is 1 +1 -2 -1 +1 +1 = 1 C G A C C T C G ─ C C T 1 1 -1 1 1 -2 16 The optimal global alignment of two strings is defined as the alignment that maximizes the total alignment score over all possible alignments [7] and it can be found through the following steps: 1. Enumerate all possible alignments. 2. Score each alignment. 3. Choose the alignment with the highest score. The problem of this approach is that the number of possible alignments is prohibitively large (exponential in the length of the sequences). Note that because nucleotides differ very little in biochemical functions, simple scoring functions are often used for DNA alignment. Amino acids, on the other hand, can be quite different from one another, and mismatches can be of varied effect depending on how similar or dissimilar amino acids are in their biochemical properties. Scores based on inferences about chemical or physical properties of proteins are possible and useful. It is well known that certain pairs of amino acids are much more likely to substitute for each other during evolution than others. This is likely due to certain physicochemical properties that they have in common, such as their hydrophobicity, size, or electrical charge. A good alignment should consider this and incorporate it into the scoring function so that the overall alignment reflects the biological similarity between sequences more closely. The two most common scoring functions that do this are based on observed substitution frequencies in proteins, and are called PAM (Percent Accepted Mutations) matrices [8], and BLOSUM (Blocks Substitution Matrix) matrices [9]. Fortunately, there exists an algorithm depend on dynamic programming called Needleman and Christian Wunsch Algorithm that computes the best alignment in O(mn) time, where m and n is the length of the two sequences. We will first demonstrate Needleman and Christian Wunsch Algorithm through an example. 17 Example: Suppose we want to know the score of the best alignment of s = AAAC and t = AGC using our simple scoring function. Notation: let s(i) and t(j) denote the ith and jth characters of s and t respectively. Considering just the last column of the alignment, we have only three possibilities: 1. The last character of s(C) is aligned with the last character of t(C). in this case the score of the best alignment of s and t is equal to the score of the best alignment of AAA (the remaining portion of s) and AG (the remaining portion of t), plus 1 for matching the last character. 2. The last character of s(C) is aligned with a gap. In this case the score of the best alignment of s and t is equal to the score of the best alignment of AAA (the remaining portion of s) and AGC (all of t), minus2 for inserting a gap. 3. The last character of t(C) is aligned with a gap. In this case the score of the best alignment of s and t is equal to the score of the best alignment of AAAC (all of s) and AG (the remaining portion of t), minus 2 for inserting a gap. If we know the answers to the three sub problems mentioned above, then we will know the score of the best alignment between s and t. note that the sub problems consist of aligning prefixes of s and t. we will find and save optimal solutions for all prefixes of s and t, building up from shorter ones to longer ones. There are five prefixes for s:empty, A, AA, AAA, and AAAC, and we will refer to these prefixes as the 0th, 1st, 2nd, 3rd, and 4th prefixes of s likewise there are four prefixes for t: empty, A, AG, and AGC. The algorithm uses a matrix representation (in this case a 5x4 matrix), with characters of s along the rows and characters of t along the columns as shown in the figure 2.2 Figure: 2.1. Matrix sim for scoring alignment of s = AAAC and t = AGC. 18 We will define sim(i, j) to correspond to the optimal alignment score (the "similarity") of the ith prefix of s with the jth prefix of t. thus, the matrix reflects the similarity scores for all prefixes of s and t. In general, when we are aligning sequence s of length m and sequence t of length n, sim(m, n) has the answer we are looking for, and we will fill out the entire matrix in order to get this last score. Each element is determined by our sim function, which takes the following form: sim (i 1, j 1) 1, sim (i , j ) max sim (i 1, j ) 2, sim (i , j 1) 2, align s (i ) with t ( j ), +1 for a match, -1 for mismatch align s (i ) with agap align t ( j ) with agap In our array in order to compute sim(i, j), we need to have three entries precomputed: sim(i, j), sim(i-1, j), sim(i, j-1). If we compute entries row by row left to right, we will always have things computed when we need them. We start by filling the 0th row and column: using our scoring function, an alignment of a single letter string with a gap result in a score of -2; similarly, alignment of a two letter string with a gap results in a score of -4. In general, the alignment of a string of i letters with a gap gives a score of -2i and the 0th row and column may thus be filled in accordingly (see figure 2.1). Note that aligning of a gap with a gap contributes nothing to the alignment of the strings, so the score of sim(0, 0) is zero. Now we have all the information we need to evaluate array element (1 , 1): sim(1, 1) is the alignment of A and A according to the function: sim (0, 0) 1, sim (1, 1) max sim (0,1) 2, sim (1, 0) 2, align s (1) with t (1), (= 0 + 1 = 1) align s (1) with agap (= -2 - 2 = -4) align t (1) with agap (= -2 - 2 = -4) 19 The maximum is found at sim(0, 0) and evaluate to 1. We place 1 in our array element and, since the maximum came from element (0, 0), we keep track of this (by "drawing" an arrow pointing to that array element). See figure 2.1. We have the information required to evaluate sim(1, 2) in the same manner: sim (0, 0) 1, sim (1, 2) max sim (0, 2) 2, sim (1,1) 2, align s (1) with t (2), (= -2 - 1 = -3) align s (1) with agap (= -4 - 2 = -6) align t (2) with agap (= 1 - 2 = -1) The maximum is -1 (fill this value in the matrix) and it came from the element (1,1), so draw an arrow pointing to the left, toward element (1 , 1). The array element sim(1, 3) can be obtained as follow: sim (0, 2) 1, sim (1, 3) max sim (0,3) 2, sim (1, 2) 2, align s (1) with t (3), (= -4 - 1 = -5) align s (1) with agap (= -6 - 2 = -8) align t (3) with agap (= -1 - 2 = -3) The maximum is -3 (fill this value in the matrix) and it came from the element (1,2), so draw an arrow pointing to the left, toward element (1,2). sim (1, 0) 1, sim (2, 1) max sim (1,1) 2, sim (2, 0) 2, align s (2) with t (1), (= -2 + 1 = -1) align s (2) with agap (= 1 - 2 = -1) align t (1) with agap (= -4 - 2 = -6) The maximum is -1 (fill this value in the matrix) and it came from both the elements (1,0) and (1,1) so draw an arrow pointing upward, toward element (1,1) and an arrow pointing diagonally toward the element (1,0). Continuing with this process we obtain the full matrix (with arrows) depicted in the previous figure. The final alignment score is in element (4, 3) and is -1. 20 Now the question is how do we construct the alignment itself? This is where the arrow come in. start with the final array element and follow the arrow back. An arrow from (i,j) pointing to element (i-1,j-1) (to the diagonally upper left) means to align s(i) and t(j) with each other. An arrow pointing upward to (i-1, j) mans to align s(i) with a gap, and an arrow pointing to the left to (i, j-1)means to align t(i) with a gap. Continuing through the three possible arrow paths, we are able to build three possible alignments (with the same scores): 1) AAAC total score = sum of the individual scores at each position ─AGC = -2 +1 -1 +1 = -1 2) AAAC total score = sum of the individual scores at each position A─GC = +1 -2 -1 +1 = -1 3) AAAC total score = sum of the individual scores at each position ─ AGC = -2 +1 -1 +1 = -1 Each of these alignments is equally good (two match, one mismatch, one align with a gap). Note that we always recorded the score for the best path into each element. There are paths through the matrix corresponding to very bad alignments. For example, the alignment corresponding to moving down through the first column then right through the last row is AAAC - - - - - total score = -2 -2 -2 -2 -2 -2 -2 = -14 AGC 21 2.1.1.1. Global sequence alignment algorithm (Needleman and Christian Wunsch Algorithm) Our problem in pairwise global sequence alignment is when dealing with long sequences because the problem of finding the optimal alignment from all possible alignments become more difficult because The number of different alignments (with gaps) of two sequences of length n is 2n , n a quantity which grows exponentially with n. This means that for two sequences of length 30, there are approximately10 17 possible alignments between them, so that the naïve approach would have an exponentially high cost when calculate the scores for all possible alignments. This problem can be solved efficiently by using Needleman– Wunsch algorithm [10] There are other dynamic programming algorithms for pairwise global alignment such as Huang and Chao [11] and NGILA [12]. This remarkable algorithm is guaranteed to find the optimal score for any given symbol-scoring function in feasible time. This algorithm can find the best alignment without enumerating all alignments where it works depending on dynamic programming DP (DP is explained in details in section 2.1). DP methods – such as the Needleman–Wunsch (NW) algorithm – allow us to start the computation by aligning very short DNA sequences, and growing this alignment efficiently to the full length of the two sequences. When implemented well, this approach has a much lower computational cost than the naïve solution. There are three elements to DP algorithms in sequence alignment: a recursive relation, a tabular computation, and a trace-back procedure. 22 Needleman and Christian Wunsch Algorithm can be summarized as follows: 1. First we consider any two strings such as A = ai i=1,2,…,n B = bj j=1,2,…,m 2. Alignment matrix of size (n+1)*(m+1) is constructed ( ) score of aligning ( ) with { ( ) −µ and –δ refer to the score of mismatch and indel respectively. Indel means insertions or deletions and it refer to the alignment of a gap with a character and it can be represented by using the symbol"−". 3. The score for elements in the first row and column of the alignment matrix S i ,0 i , are given by S 0, j j 4. Starting from top left, compute each entry in the alignment matrix using the formula: sim (i 1, j 1) 1 or , align ai with b j , +1 for a match, - for mismatch sim (i , j ) max sim (i 1, j ) , align ai with agap sim (i , j 1) , align b j with agap Such that sim(i, j) is the optimal alignment score, but ai and bj denote the ith and jth characters of A and B respectively. 5. Perform tracing back element by element along the path that yielded the maximum score into each matrix element. 23 Needleman–Wunsch (NW) algorithm takes O(mn) time and O(mn) space there is also a space saving version of the algorithm that takes O(m+n)space but still work in O(mn) time[4]. 2.1.2. Local sequence alignment We knew that proteins are built of modules that can be mixed and matched. Thus, the global similarity may be poor for some sequences, but we may want to look for local regions of similarity that suggest shared structural or functional subunits. Definition: a local alignment of strings s and t is an alignment of a substring of s and a substring of t. The best alignment of substrings of s and t is called the optimal local alignment. This can be done through the removing of a prefix and a suffix in each of the two sequences, and testing how well we can align the remaining internal substrings. As an example may want to find similar substrings within the sequences s = QUEVIVALASVEGAS and t = VIVADAVIS. This could be accomplished by computing the best (global) alignment between all subsequences in s and all subsequences in t, each subsequence being defined by ignoring a prefix and a suffix in the original sequence. A possible (but not optimal) local alignment is V I V A L A S V | | | | | | | V I V A D A -V Where a prefix and suffix have been removed from the original sequences. For clarity we show the subsequences and their prefixes/suffixes in an alignment like representation: Q U E V I VALASV E G A S R R R R R R R - - - V I VADA-V - - I S 24 where we denote by "R" the removed parts. Note that the optimal local alignment is the one presented in the alignment containing only the subsequences without their prefixes and suffixes above. Local alignment appears to be harder than global alignment, since it contains many instances of global alignment within it: we are not only optimizing over all possible alignments, but over all possible choices of starts and ends of each substring. However, a very clever adaptation of the Needleman–Wunsch algorithm, called the Smith–Waterman (SW) algorithm, makes it possible to perform local alignment with the same cost as global alignment. The keys to local alignment are to use a slightly more complex scoring function, and to use a different method for reading the desired alignment from the table. 2.1.2.1 Local sequence alignment algorithm (Smith and Waterman Algorithm) The local alignment algorithm (or smith and Waterman Algorithm) [13] involves a very simple modification to the global alignment algorithm. We will again use a (m+1)x(n+1) matrix sim. However this time the (i, j)th entry of the matrix sim will hold the optimal alignment between a suffix of the ith prefix of s and a suffix of the jth prefix of t. The basic intuition is that a suffix of a prefix is a substring, and sim(i, j) holds the best alignment score between substrings that ends in s(i) and t(j). Suppose that we want the best alignment between substrings of s ending in s(i) and a substring of t ending in t(j). This means that we need to pick the best starting points for the strings, since their endpoints are determined. Note that s(i) is not necessarily aligned with t(j). We will fill in the sim matrix. A small addition to the global alignment algorithm solves this problem. First we zero out the first row and column of sim, because there are no initial gaps in the best local alignment. Then we use the rule: sim (i 1, j 1) 1 sim (i 1, j ) 2 sim (i , j ) max sim (i , j 1) 2 0 25 Where the sign for the first case is chosen as before, according to whether s(i) matches t(j) or not. As we fill in the array, also a zero in the fourth case is used when the optimal score become negative. From above we can say that local alignment differs from the global alignment in the following: 1. The starting point of the alignment in local alignment occurs at the largest score in the scoring matrix and we stop when hitting a zero, while in global alignment we start from the lower right cell and end at the upper left cell. 2. In local alignment if the lowest value in the scoring matrix become negative then we set it to zero, but in global alignment it may be negative value (i.e. less than zero). See figure 2.2 for an example of the algorithm at work, in a local alignment of strings s = AGCT and t = GCA. Figure: 2.2. Local Alignment Example 26 To construct an optimal local alignment, we look for the maximum value in the matrix. Then we follow its arrows back until we hit a zero. In figure 2.2 we choose the value 2 in entry sim(3,2). This gives the local alignment s: GC t: GC If we want further optimal alignments, we take another maximum value in the matrix, generally one that has not already been visited by a previous path. If we want the second best local alignment, we take the second highest value in the matrix that has not already been visited. These alignments are sometimes useful and are called near-optimal alignments. In our case, if we start from sim(1,3), we obtain the local alignment s: A t: A this algorithm uses time O(mn) and space O(mn). Again it is possible to improve it to use space O(m+n) [4]. 2.2 Gene Tracer algorithm In this section, we present a proposed algorithm, called Gene Tracer [14] that is used to get relations between sequences to determine homology of them (genes or characters B and C that evolved from the same ancestral gene or character A are said to be homologs [6]). This algorithm is used in application such as determining originality or functionality of sequences. In other words gene tracer is used to trace genes modification from ancestor sequences through offspring sequence. It tracks down genes modification in the ancestor sequences and finds related parts of each ancestor sequence in the offspring one. Gene Tracer can find precisely the location of the ancestor sequences contribution inside the offspring one and gives statistical results that express the relationship between the two ancestor sequences and their offspring one as shown in figure 2.4. 27 Figure: 2.3. Gene Tracer Function Gene Tracer’s inputs are two ancestor sequences and one offspring sequence. Gene Tracer’s output is as shown in figure 2.3 the output is three sequences have colored parts which represent the matching parts. Longest matching parts common between ancestor 1 and offspring sequence are given a red color with length L and the total length of ancestor 1 is Z1 then matching percentage is L / Z1. The same was done with ancestor 2 and offspring but in blue color and percentage is K / Z2. Gene Tracer algorithm was modified of Smith–Waterman (SW) local alignment algorithm. The modifications are as following: 1- Determining locations of common substrings in both ancestors and offspring sequences. 2- Distinguishing these common substrings by giving them clear and different color. Gene Tracer is of complexity O(max(M,N)*P) in computing time and memory space, where M and N are respectively the lengths of the two ancestor sequences and P is the one of the offspring sequence. 28 2.3 Multiple Sequence Alignment algorithms In previous sections we consider alignments between two sequences. Here, we consider the case where we wish to align three or more entire sequences (i.e., global multiple sequence alignments). Definition: Given k strings, S1, S2, …, SK, a multiple sequence alignment (MSA) is obtained by inserting gap in the strings to make them all the same length. e.g., the following is a MSA of 4 sequences MQPILLLV, MLRLL, MKILLL, and MPPVLILV. M M M M Q P I L R K - I P P V L L L V L L L L L L I L V Note that no column may be all gaps. Multiple sequence alignments are used for many reasons, including: 1. To detect regions of variability or conservation in a family of proteins (that might be missed in a pairwise alignment). 2. To provide stronger evidence than pairwise similarity for structural and functional inferences. 3. To serve as the first step in other computational procedures such as: phylogenetic reconstruction RNA secondary structure, prediction In building profiles (probabilistic model) for protein families or DNA signals these profiles can be used in locating other similar sequences. For pairwise alignments, we scored each column by looking at matches, mismatches, and gaps in the two sequences (in practice protein sequences are scored using substitution matrices). 29 However, now we have multiple characters in each column, and it is not obvious what the best way to score a column is. There are many possibilities. The sum-of-pairs (SP) is a common scoring scheme. Here, each column in an alignment is scored by summing the scores of all pairs of symbols in that column. The score of the entire alignment is then summed over all column scores. We will assume that a match = 1, a mismatch = -1, and a gap = -2. For example, the sum-of-pairs score of the 4th column of the MSA given earlier is: SP(I,-,I,V) = score(I,-) + score(I,I) + score(I,V) + score(-,I) + score(-,V) + score(I,V) = -2+1-1-2-2-1= -7 Although there is never an entire column of gaps, if we look at any 2 sequences in the alignment, there may be columns where both have gaps, in this case the value of score(-,-) is set equal to zero. A very popular and efficient heuristic algorithm for multiple alignment is CLUSTAL, originally developed by Desmond Higgins and Paul Sharp at Trinity College, Dublin in 1988 and extended by Higgins, Julie Thompson, and Toby Gibson into the current version, CLUSTALW [15]. CLUSTALW how it work? CLUSTALW is a general purpose multiple sequence alignment program for DNA or proteins and it’s work is as follow: 1. The sequences are aligned pair-wise using the Needleman-Wunsch algorithm (global alignment). 2. Compute the alignment scores between all pairs of sequences. 3. Build a guide tree that reflects the similarities between sequences, using the pair-wise alignment distances. 4. Align the sequences following the guide tree. Corresponding to each node in the tree, the algorithm aligns the sequences or alignments that are associated with its two daughter nodes. The process is repeated beginning from the leaves (sequences) and ending with the tree root. 30 CHAPTER THREE USING SEQUENCE ALIGNMENT ALGORITHMS TO DEMONSTRATE THE GENETIC EVOLUTION OF HEAT SHOCK FACTOR1 (HSF1) IN DIFFERENT EUKARYOTIC ORGANISMS 3.1. Overview The process of gene tracing of specific gene or of all genome in species is very important for biologist. Sometimes the aim of gene tracing process is to know the genetic evolution that occurs in a certain protein sequence in specific organisms and also to know the genetic evolution that occurs in a conserved domain in this protein sequence. This process is very difficult one for biologists, but using computer algorithms makes this problem easier. In this chapter we introduce a comprehensive study by using different alignment algorithms that are explained in details in chapter two through which we demonstrate the genetic evolution that occurs in heat shock factor protein 1 (HSF1) in some eukaryotic organisms such as human, Danio rerio, Taurus, mouse, plant (Arabidopsis) and yeast. In addition, this study will illustrate the molecular evolution in the conserved domains of HSF1 (HSF_DNA-bind) throughout the different eukaryotic organisms. 3.2. Introduction the focus of much recent biological evolution research has been on the detection of similarities and dissimilarities among living species. As the study of nature continues, human knowledge of variations among species has grown gradually both in all directions. This knowledge is used to give names to species we know. Identifying, naming, and organizing species into groups is a science called Taxonomy. Up till now, millions of species have been identified [16], demonstrating the current vast knowledge about species. Scientists go further to think about the cause and evolution of the observed similarities and dissimilarities between different organisms. Many biologists including Charles Darwin's and many others have suggested that all living organisms share a common ancestor 31 .Meanwhile, the differences among species are partly caused by mutations accumulated over generations during the course of evolution . Thus, Phylogenetic is a science that study of the evolutionary relatedness among species. Researchers have established the links among seemingly different life forms, from bacteria, to animals and plants using [17]. Pairwise global alignment algorithms are intended for comparing two sequences that are entirely similar. A dynamic programming (DP) algorithm called Needleman and Wunsch [18] was proposed for pairwise global alignment. Those methods are very useful in analysis of DNA and protein sequences. There are other dynamic programming algorithms for pairwise global alignment such as Huang and Chao [11] and NGILA [12]. There are another algorithms for making pairwise local alignment such as the algorithm introduced by Smith and Michael Waterman [13]. Another algorithm is used to make multiple sequence alignment MSA, it introduced by Thompson et al [15]. The algorithm which introduced by Thompson et al to make multiple sequence alignment depend on the progressive alignment. This works by constructing a succession of pairwise alignment. Initially, two sequences are chosen and aligned by standard pairwise alignment; this alignment is fixed. Then, a third sequence is chosen and aligned to the first alignment, and this process is iterated until all sequences have been aligned. Progressive alignment is heuristic: it does not separate the process of scoring an alignment from the optimization algorithm. It does not directly optimize any global scoring function of alignment correctness. The advantage of progressive alignment is that it is fast and efficient, and in many cases the resulting alignments are reasonable. General results are illustrated in conclusion. 32 3.3. Heat shock factor protein1 (HSF1) In biochemistry, heat shock (HS) is the effect of subjecting a cell to a higher temperature than of the ideal body temperature of the organism from which the cell line was derived. Heat shock factor (HSF), in molecular biology, is the name given to transcription factors that regulate the expression of the heat shock proteins. HSF1 is a member of the heat shock transcription factor family and it considers as the major regulator of heat shock protein transcription in eukaryotes. Protein-damaging stress lead to the activation of HSF1 which binds to upstream regulatory sequences in the promoters of heat shock genes leading to enhanced heat shock gene expression. Figure: 3.1. Structure of Heat Shock Factor Protein 1 Dna Binding Domain From Homo Sapiens. 33 3.4. Biological data used in our work All biological data that are used are obtained from the national center for biotechnology information (NCBI) and it has been written in FASTA format. FASTA format is a one of the format types in writing sequence, and it is general format for all data bases. In our work we use the following data: 1. Data represent the entire sequence of heat shock factor protein1 of [homo sapiens] in FASTA format >gi|5031767|ref|NP_005517.1| heat shock factor protein 1 [Homo sapiens] MDLPVGPGAAGPSNVPAFLTKLWTLVSDPDTDALICWSPSGNSF HVFDQGQFAKEVLPKYFKHNNMASFVRQLNMYGFRKVVHIEQ GGLVKPERDDTEFQHPCFLRGQEQLLENIKRKVTSVSTLKSEDIKI RQDSVTKLLTDVQLMKGKQECMDSKLLAMKHENEALWREVAS LRQKHAQQQKVVNKLIQFLISLVQSNRILGVKRKIPLMLNDSGSA HSMPKYSRQFSLEHVHGSGPYSAPSPAYSSSSLYAPDAVASSGPII SDITELAPASPMASPGGSIDERPLSSSPLVRVKEEPPSPPQSPRVEE ASPGRPSSVDTLLSPTALIDSILRESEPAPASVTALTDARGHTDTE GRPPSPPPTSTPEKCLSVACLDKNELSDHLDAMDSNLDNLQTML SSHGFSVDTSALLDLFSPSVTVPDMSLPDLDSSLASIQELLSPQEPP RPPEAENSSPDSGKQLVHYTAQPLFLLDPGSVDTGSNDLPVLFEL GEGSYFSEGDGFAEDPTISLLTGSEPPKAKDPTVS 2. Data represent the conserved domain (HSF_DNA-bind) of heat shock factor protein1 sequence of [homo sapiens] in FASTA format NVPAFLTKLWTLVSDPDTDALICWSPSGNSFHVFDQGQFAKEVL PKYFKHNNMASFVRQLNMYGFRKVVHIEQGGLVKPERDDTEFQ HPCFLRGQEQLLENIKRK 34 3. Data represent the entire sequence of heat shock factor protein1 of [Danio rerio] in FASTA format >gi|18858865|ref|NP_571675.1| heat shock factor protein 1 [Danio rerio] MEYHSVGPGGVVVTGNNVPAFLTKLWTLVEDPDTDPLICWSPN GTSFHVFDQGRFSKEVLPKYFKHNNMASFVRQLNMYGFRKVVH IEQGGLVKPEKDDTEFQHPYFIRGQEQLLENIKRKVTTVSNIKHE DYKFSTDDVSKMISDVQHMKGKQESMDSKISTLKHENEMLWRE VATLRQKHSQQQKVVNKLIQFLITLARSNRVLGVKRKMPLMLN DSSSAHSMPKFSRQYSLESPAPSSTAFTGTGVFSSESPVKTGPIISD ITELAQSSPVATDEWIEDRTSPLVHIKEEPSSPAHSPEVEEVCPVE VEVGAGSDLPVDTPLSPTTFINSILQESEPVFRPDSAPSEQKCLSV ACLDNYPQMSEITRLFSGFSTSSLHLRPHSGTELHDHLESIDSGLE NLQQILNAQSINFDSSPLFDIFSSAASDVDLDSLASIQDLLSPDPVK ETESGVDTDSGKQLVQYTSQPSFSPIPFSTDSSSTDLPMLLELQDD SYFSSEPTEDPTIALLNFQPVPEDPSRTRIGDPCFKLKKESKR 4. Data represent the conserved domain (HSF_DNA-bind) of heat shock factor protein1 sequence of [Danio rerio] in FASTA format AFLTKLWTLVEDPDTDPLICWSPNGTSFHVFDQGRFSKEVLPKYF KHNNMASFVRQLNMYGFRKVVHIEQGGLVKPEKDDTEFQHPYF IRGQEQLLENIKRK 5. Data represent the entire sequence of heat shock factor protein1 of [Bos taurus] in FASTA format >gi|116003843|ref|NP_001070277.1| heat shock factor protein 1 [Bos taurus] MDLPVGPGAAGPSNVPAFLTKLWTLVSDPDTDALICWSPSGNSF HVLDQGQFAKEVLPKYFKHSNMASFVRQLNMYGFRKVVHIEQG GLVKPERDDTEFQHPCFLRGQEQLLENIKRKVTSVSTLRSEDIKIR QDSVTKLLTDVQLMKGKQESMDSKLLAMKHENEALWREVASL RQKHAQQQKVVNKLIQFLISLVQSNRILGVKRKIPLMLNDGGPA HPMPKYGRQYSLEHIHGPGPYPAPSPAYSGSSLYSPDAVTSSGPII SDITELAPGSPVASSGGSVDERPLSSSPLVRVKEEPPSPPQSPRAEG ASPGRPSSMVETPLSPTTLIDSILRESEPTPVASTTPLVDTGGRPPS PLPASAPEKCLSVACLDKTELSDHLDAMDSNLDNLQTMLTSHGF 35 SVDTSTLLDLFSPSVTVPDMSLPDLDSSLASIQELLSPQEPPRPLEA EKSSPDSGKQLVHYTAQPLLLLDPGSVDVGSSDLPVLFELGEGSY FSEGDDYSDDPTISLLTGSEPPKAKDPTVS 6. Data represent the conserved domain (HSF_DNA-bind) of heat shock factor protein1 sequence of [Bos taurus] in FASTA format NVPAFLTKLWTLVSDPDTDALICWSPSGNSFHVLDQGQFAKEVL PKYFKHSNMASFVRQLNMYGFRKVVHIEQGGLVKPERDDTEFQ HPCFLRGQEQLLENIKRK 7. Data represent the entire sequence of heat shock factor protein1 of [Mus musculus] in FASTA format >gi|62740231|gb|AAH94064.1| Hsf1 protein [Mus musculus] MDLAVGPGAAGPSNVPAFLTKLWTLVSDPDTDALICWSPSGNSF HVFDQGQFAKEVLPKYFKHNNMASFVRQLNMYGFRKVVHIEQ GGLVKPERDDTEFQHPCFLRGQEQLLENIKRKVTSVSTLKSEDIKI RQDSVTRLLTDVQLMKGKQECMDSKLLAMKHENEALWREVAS LRQKHAQQQKVVNKLIQFLISLVQSNRILGVKRKIPLMLSDSNSA HSVPKYGRQYSLEHVHGPGPYSAPSPAYSSSSLYSSDAVTSSGPII SDITELAPTSPLASPGRSIDERPLSSSTLVRVKQEPPSPPHSPRVLE ASPGRPSSMDTPLSPTAFIDSILRESEPTPAASNTAPMDTTGAQAP ALPTPSTPEKCLSVACLDNLARAPQMSGVARLFPCPSSFLHGRVQ PGNELSDHLDAMDSNLDNLQTMLTSHGFSVDTSALLDIQELLSP QEPPRPIEAENSNPDSAGALHGSASVPAGS 8. Data represent the conserved domain (HSF_DNA-bind) of heat shock factor protein1 sequence of [Mus musculus] in FASTA format NVPAFLTKLWTLVSDPDTDALICWSPSGNSFHVFDQGQFAKEVL PKYFKHNNMASFVRQLNMYGFRKVVHIEQGGLVKPERDDTEFQ HPCFLRGQEQLLENIKRK 36 9. Data represent the entire sequence of heat shock factor protein1 of yeast [Saccharomyces cerevisiae] in FASTA format >gi|1322586|emb|CAA96777.1| HSF1 [Saccharomyces cerevisiae] MNNAANTGTTNESNVSDAPRIEPLPSLNDDDIEKILQPNDIFTTDR TDASTTSSTAIEDIINPSLDPQSAASPVPSSSFFHDSRKPSTSTHLVR RGTPLGIYQTNLYGHNSRENTNPNSTLLSSKLLAHPPVPYGQNPD LLQHAVYRAQPSSGTTNAQPRQTTRRYQSHKSRPAFVNKLWSM LNDDSNTKLIQWAEDGKSFIVTNREEFVHQILPKYFKHSNFASFV RQLNMYGWHKVQDVKSGSIQSSSDDKWQFENENFIRGREDLLE KIIRQKGSSNNHNSPSGNGNPANGSNIPLDNAAGSNNSNNNISSS NSFFNNGHLLQGKTLRLMNEANLGDKNDVTAILGELEQIKYNQI AISKDLLRINKDNELLWQENMMARERHRTQQQALEKMFRFLTSI VPHLDPKMIMDGLGDPKVNNEKLNSANNIGLNRDNTGTIDELKS NDSFINDDRNSFTNATTNARNNMSPNNDDNSIDTASTNTTNRKK NIDENIKNNNDIINDIIFNTNLANNLSNYNSNNNAGSPIRPYKQRY LLKNRANSSTSSENPSLTPFDIESNNDRKISEIPFDDEEEEETDFRP FTSRDPNNQTSENTFDPNRFTMLSDDDLKKDSHTNDNKHNESDL FWDNVHRNIDEQDARLQNLENMVHILSPGYPNKSFNNKTSSTNT NSNMESAVNVNSPGFNLQDYLTGESNSPNSVHSVPSNGSGSTPLP MPNDNDTEHASTSVNQGENGSGLTPFLTVDDHTLNDNNTSEGST RVSPDIKFSATENTKVSDNLPSFNDHSYSTQADTAPENAKKRFVE EIPEPAIVEIQDPTEYNDHRLPKRAKK 10. Data represent the conserved domain (HSF_DNA-bind) of heat shock factor protein1 sequence of yeast [Saccharomyces cerevisiae] in FASTA format SRPAFVNKLWSMLNDDSNTKLIQWAEDGKSFIVTNREEFVHQIL PKYFKHSNFASFVRQLNMYGWHKVQDVKSGSIQSSSDDKWQFE NENFIRGREDLLEKIIRQ 37 11. Data represent the entire sequence of heat shock factor protein1 of plant [Arabidopsis thaliana] in FASTA format >gi|7268528|emb|CAB78778.1| heat shock transcription factor HSF1 [Arabidopsis thaliana] MFVNFKYFSFFIRTKMDGVTGGGTNIGEAVTAPPPRNPHPATLLN ANSLPPPFLSKTYDMVEDPATDAIVSWSPTNNSFIVWDPPEFSRD LLPKYFKHNNFSSFVRQLNTYGFRKVDPDRWEFANEGFLRGQK HLLKKISRRKSVQGHGSSSSNPQSQQLSQGQGSMAALSSCVEVG KFGLEEEVEQLKRDKNVLMQELVKLRQQQQTTDNKLQVLVKH LQVMEQRQQQIMSFLAKAVQNPTFLSQFIQKQTDSNMHVTEAN KKRRLREDSTAATESNSHSHSLEASDGQIVKYQPLRNDSMMWN MMKTDDKYPFLDGFSSPNQVSGVTLQEVLPITSGQSQAYASVPS GQPLSYLPSTSTSLPDTIMPETSQIPQLTRESINDFPTENFMDTEKN VPEAFISPSPFLDGGSVPIQLEGIPEDPEIDELMSNFEFLEEYMPESP VFGDATTLENNNNNNNNNNNNNNNNNNNNTNGRHMDKLIEEL GLLTSETEH 12. Data represent the conserved domain (HSF_DNA-bind) of heat shock factor protein1 sequence of plant [Arabidopsis thaliana] in FASTA format PFLSKTYDMVEDPATDAIVSWSPTNNSFIVWDPPEFSRDLLPKYF KHNNFSSFVRQLNTYGFRKVDPDRWEFANEGFLRGQKHLLKKIS RRKS 3.5. Results and Discussion 3.5.1. Pairwise global alignment among the common conserved domains (HSF_DNA-bind) and also among the entire sequence of HSF1 protein First, we have made a code by using MATLAB program to make pairwise global alignment among the common conserved domains (HSF_DNA-bind) that is a part of the heat shock factor protein1 sequence (HSF1) in all sequences of eukaryotic organisms that we are used and also we made pairwise global alignment among the entire sequences of HSF1 protein. 38 The results were as follows:- 1) The results obtained by using sequence dot plot matrix in case of the conserved domains (HSF_DNA-bind) in HSF1 protein sequences yeast 10 20 30 40 yeast 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 10 10 20 20 30 30 40 taurus plant 40 50 50 60 60 70 70 80 80 90 90 100 Dot plot matrix between plant & yeast Dot plot matrix between yeast & taurus taurus 10 20 30 40 50 taurus 60 70 80 90 100 10 10 10 20 20 20 30 40 50 60 70 80 90 100 30 30 40 daniorerio plant 40 50 50 60 60 70 70 80 80 90 90 100 Dot plot matrix between taurus & Danio rerio Dot plot matrix between taurus & plant yeast taurus 20 30 40 50 60 70 80 90 10 100 10 10 20 20 30 30 40 40 mouse human 10 50 20 30 40 50 60 70 80 90 100 50 60 60 70 70 80 80 90 90 100 100 dot plot matrix between yeast & mouse dot plot matrix between taurus & human 39 taurus 10 20 30 40 50 mouse 60 70 80 90 100 10 10 20 30 40 50 60 70 80 90 100 10 20 20 30 30 40 plant mouse 40 50 50 60 60 70 80 70 90 80 100 90 dot plot matrix between mouse & plant dot plot matrix between taurus & mouse human 10 20 30 40 50 60 70 80 90 mouse 100 10 20 30 40 50 60 70 80 90 100 10 10 20 20 30 30 40 daniorerio mouse 40 50 60 50 60 70 70 80 80 90 90 100 100 dot plot matrix between human & mouse dot plot matrix between mouse & Danio rerio human yeast 10 20 30 40 50 10 60 70 80 90 20 30 40 50 60 70 80 90 100 10 10 20 20 30 30 40 plant human 40 50 50 60 60 70 80 70 90 80 100 90 dot plot matrix between yeast & human dot plot matrix between plant & human 40 100 yeast daniorerio 20 30 40 50 60 70 80 90 10 100 10 20 20 30 30 40 40 daniorerio 10 50 60 20 30 40 50 60 70 80 90 100 50 60 70 70 80 80 90 90 100 100 dot plot matrix between yeast & Danio rerio dot plot matrix between Danio rerio & human daniorerio 10 20 30 40 50 60 70 80 90 100 10 20 30 40 plant human 10 50 60 70 80 90 dot plot matrix between Danio rerio & plant Figure: 3.2. Results of sequence dot plot matrix among the conserved domains (HSF_DNA-bind) in HSF1 protein sequences 41 2) The results obtained by using Needleman-Wunsch algorithm in case of the conserved domains (HSF_DNA-bind) in HSF1 protein sequences In our work we use the following scoring matrices:1- Blosum50 2- Blosum30 Conserved domains (HSF_DNA-bind) Human & Danio rerio Human & Yeast Human & Plant (Arabidopsis) Human & Mouse Human & Taurus Danio rerio & Mouse Danio rerio & Taurus Danio rerio & Plant (Arabidopsis) Danio rerio & Yeast Mouse & Taurus Mouse & Yeast Mouse & Plant (Arabidopsis) Taurus & Yeast Taurus & Plant (Arabidopsis) Yeast & Plant (Arabidopsis) Score by using Blosum50 210.667 114.333 91.6667 244.667 240.333 210.667 206.333 92 103.667 240.333 114.333 91.667 116 88.6667 71.6667 Score by Identities using Blosum30 89% 135.2 46% 74 51% 60.8 100% 156.8 98% 153.6 89% 135.2 87% 132 51% 61.2 46% 66.6 98% 153.6 46% 74 51% 60.8 47% 75.2 50% 58.6 43% 51 Identities 89% 46% 51% 100% 98% 89% 87% 51% 46% 98% 46% 51% 47% 50% 43% Table 3.1.This table shows the score of the pairwise global alignment and also the percentage of similarity in the case of the pairwise global alignment among the conserved domains (HSF_DNA-bind) in HSF1 protein sequence of all previous organisms. The results in table3.1 show that the highest similarity is between human and mouse with percentage100%, but the least similarity is between yeast and plant (Arabidopsis) with percentage 43% and these results are the same results obtained in figure 3.2. 42 3) The results obtained by using sequence dot plot matrix in case of the entire sequences of HSF1 protein yeast 100 0 200 300 yeast 400 500 600 700 800 100 0 50 50 100 100 150 200 300 400 500 600 700 800 150 200 plant taurus 200 250 250 300 300 350 350 400 400 450 450 500 dot plot matrix between Taurus & yeast dot plot matrix between yeast & plant taurus 50 100 150 200 250 yeast 300 350 400 450 500 50 50 100 100 150 200 300 400 500 600 700 800 150 200 200 mouse plant 100 0 250 300 250 300 350 350 400 400 450 450 dot plot matrix between Taurus & human dot plot matrix between yeast & mouse human plant 50 100 150 200 250 300 350 400 50 450 150 200 250 300 350 400 50 50 100 100 150 150 200 mouse 200 mouse 100 250 250 300 300 350 350 400 400 450 450 dot plot matrix between human & mouse dot plot matrix between plant & mouse 43 450 500 taurus 50 100 150 200 250 human 300 350 400 450 500 50 50 50 100 100 150 150 150 200 250 300 350 400 450 500 200 plant 200 250 250 300 300 350 350 400 400 450 450 dot plot matrix between human & plant dot plot matrix between Taurus & mouse daniorerio yeast 100 200 300 400 500 600 700 50 800 50 50 100 100 150 150 200 200 taurus daniorerio 0 250 300 100 150 200 300 350 400 450 500 250 350 400 400 450 450 500 500 dot plot matrix between yeast & Danio rerio dot plot matrix between Taurus & Danio rerio yeast 0 250 300 350 100 200 300 400 taurus 500 600 700 800 50 50 50 100 100 150 150 200 200 human human mouse 100 250 100 150 200 250 300 350 400 450 250 300 300 350 350 400 400 450 450 500 500 dot plot matrix between Taurus & human dot plot matrix between yeast & human 44 500 daniorerio daniorerio 50 100 150 200 250 300 350 400 450 50 500 50 50 100 100 150 150 150 200 250 300 350 400 450 500 mouse 200 250 300 250 300 350 350 400 400 450 450 dot plot matrix between danio rerio & mouse dot plot matrix between danio rerio& plant human 50 100 150 200 250 300 350 400 450 500 50 100 150 200 daniorerio plant 200 100 250 300 350 400 450 500 dot plot matrix between human & danio rerio Figure: 3.3. Results of sequence dot plot matrix among the entire sequences of HSF1 protein. 45 4) The results obtained by using Needleman-Wunsch algorithm in case of the entire sequence of HSF1 protein Score by Score by HSF1 protein sequence using Identities using Identities Blosum50 Blosum30 Human & Danio rerio 544.333 56% 345.8 55% Human & Yeast 440.333 22% 239.4 21% Human & Plant (Arabidopsis) -8 27% 23.8 24% Human & Mouse 649.667 70% 406.6 70% Human & Taurus 1017.33 89% 624.6 89% Danio rerio & Mouse 415.333 52% 269.2 51% Danio rerio & Taurus 565.333 56% 355.8 56% Danio rerio & Plant (Arabidopsis) -7.6666 26% 23.8 25% Danio rerio & Yeast -397.667 22% -212.4 20% Mouse & Taurus 670 70% 418.6 70% Mouse & Yeast -547 22% -306.6 23% Mouse & Plant (Arabidopsis) 34.3333 27% 52.8 25% Taurus & Yeast -427 23% -233.2 23% Taurus & Plant (Arabidopsis) -5 27% 24.6 24% Yeast & Plant (Arabidopsis) -450.333 24% -237 23% Table 3.2.This table shows the score of the pairwise global alignment and also the percentage of similarity in case of the pairwise global alignment among the entire of HSF1 protein sequences of all previous species. The results in table3.2 show that the highest similarity is between human and Taurus with percentage 89%, but the least similarity is between human and yeast, Danio rerio and yeast, and mouse and yeast with percentage 22% and these results are the same results obtained in figure 3.3. 3.5.2. Multiple sequence alignment (MSA) among all the conserved domains (HSF_DNA-bind) in HSF1 protein sequences and also among all the entire sequences of HSF1 protein. Second we have made a code by using MATLAB program to create multiple sequence alignment (MSA) among all the conserved domains (HSF_DNA-bind) in HSF1 protein sequences of all previous organisms. We also have made MSA among all the entire sequences of HSF1 protein sequence of those organisms. Multiple sequence alignment (MSA) is used to ensure and specify the results obtained in table3.1 and table3.2. The results are illustrated by using phylogenetic trees. 46 In our work we use the following scoring matrices:1- Blosum60 2- Blosum80 3- PAM10 The results were as follow: 1) Comparative phylogenetic study of the conserved domains (HSF_DNA-bind) in HSF1 protein sequences:- Figure: 3.4. Graphical summary of the conserved domains of HSF1 protein in the different organisms human, Danio rerio, Taurus, mouse, yeast, and plant. 47 a) Matrix blosum60 A danderogram is constructed by using an HSF1 domain (HSF_DNA-bind) of human, mouse, Taurus, Danio rerio, plant (Arabidopsis), and yeast using the scoring matrix blosum60. Based on this specific matrix the Human and Mouse are the closest in the structure sequences of HSF1 domain (HSF_DNA-bind). It is also shown that the plant (Arabidopsis) and the yeast HSF1 domain (HSF_DNAbind) are closer to each other than to the other organisms. MSA among the conserved domain by using the scoring matrix blosum60 human human Branch 4 mouse mouse Branch 3 Branch 2 taurus taurus Branch 1 danio rerio danio rerio yeast Root arabidopsis 0 0.1 0.2 0.3 0.4 0.5 yeast arabidopsis 0.6 Figure: 3.5. this figure shows the similarity between the conserved domains in HSF1 protein sequences in all the last species in case of MSA by using the scoring matrix blosum60. 48 b) Matrix blosum80 When we use the scoring matrix blosum80 the same results are obtained as when using blosum60. MSA among the conserved domains by using the scoring matrix blosum 80 human human Branch 4 mouse mouse Branch 3 Branch 2 taurus taurus Branch 1 danio rerio danio rerio yeast Root arabidopsis 0 0.1 0.2 0.3 0.4 0.5 yeast arabidopsis 0.6 Figure: 3.6. this figure shows the similarity between the conserved domains in HSF1 protein sequences in all the last species in case of MSA by using the scoring matrix blosum80. 49 c) Matrix PAM10 When we use the scoring matrix PAM10 the same results obtained as when using blosum60 and blosum80. MSA among the conserved domain by using the scoring matrix pam10 human human Branch 4 mouse mouse Branch 3 Branch 2 taurus Branch 1 taurus danio rerio danio rerio yeast Root arabidopsis 0 0.1 0.2 0.3 0.4 0.5 yeast arabidopsis 0.6 Figure: 3.7. this figure shows the similarity between the conserved domains in HSF1 protein sequences in all the last species in case of MSA by using the scoring matrix PAM10. 50 2) Comparative phylogenetic study in case of the entire length of HSF1 protein sequences are as in the following phylogenetic trees:a) Matrix blosum60 A danderogram is constructed using the entire length of HSF1 protein sequences of human, mouse, Taurus, Danio rerio, plant (Arabidopsis), and yeast using the scoring matrix blosum60. Based on this specific matrix the human and Taurus are the closest in the structure of the entire length of HSF1 protein sequences. It is also shown that the plant (Arabidopsis) and the yeast are much closer to each other than to the other organisms. MSA among the entire sequences of HSF1 by using scoring matrix blosum60 human human taurus taurus Branch 3 Branch 4 Branch 2 mouse mouse Branch 1 danio rerio daniorerio Root yeast yeast arabidopsis 0 0.2 0.4 0.6 0.8 1 1.2 1.4 arabidopsis 1.6 1.8 Figure: 3.8. this figure shows the similarity between all sequences of the last species in case of HSF1 protein sequence by using the scoring matrix blosum60. 51 b) Matrix blosum80 When we use the scoring matrix blosum80 the same results are obtained as when using blosum60. MSA among the entire sequences of HSF1 by using the scoring matrix blosum80 human human taurus taurus Branch 3 Branch 4 Branch 2 mouse mouse danio rerio danio rerio Branch 1 yeast Root yeast arabidopsis arabidopsis 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Figure: 3.9. this figure shows the similarity between all sequences of the last species in case of HSF1 protein sequence by using the scoring matrix blosum80. 52 c) Matrix PAM10 When we use the scoring matrix PAM10 the same results are obtained as when using blosum60 and blosum80. MSA among the entire sequences of HSF1 by using the scoring matrix pam10 arabidopsis arabidopsis Branch 1 yeast yeast human human taurus taurus Branch 3 Root Branch 4 Branch 2 mouse mouse danio rerio danio rerio -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 Figure: 3.10. this figure shows the similarity between all sequences of the last species in case of HSF1 protein sequence by using the scoring matrix PAM10. 53 3.5.3. Using Gene Tracer algorithm to specify some results obtained in table3.1 and table3.2 In this section we use Gene Tracer algorithm [14] which described in details in chapter two. As we know that Gene Tracer algorithm gives two ancestor sequences and their offspring one, tracks down genes modification in the ancestor sequences, and finds related parts of each ancestor in the offspring one. As shown in table3.1 and table3.2 human and Taurus are more closer than other organisms on the entire length of the HSF1 sequence, but the HSF1 conserved domain (HSF_DNA-bind) sequence is more similar between human and mouse comparing to others. So that we use Gene Tracer algorithm to specify the related parts of each ancestor sequence in the offspring one. Moreover, Gene Tracer is used to find precisely the location of the ancestor sequences contribution inside the offspring one and gives statistical results that express the relationship between the two ancestor sequences and their offspring one. We consider Taurus as ancestor1 and mouse as ancestor2, and human as the offspring. We have coded Gene Tracer algorithm in Perl and have applied it on both the conserved domain sequences (HSF_DNA-bind) and the entire sequence of HSF1 protein. The results were as in the following table: Conserved domain (HSF_DNA-bind) Match percentage The entire sequence of HSF1 protein Match percentage Human & Mouse 100% Human & Mouse 55.26% Human & Taurus 96.19% Human & Taurus 100% Table 3.3.This table shows the match percentage of the pairwise local alignment in the case of the entire sequence of HSF1 protein and also the conserved domain (HSF_DNA-bind) in HSF1 protein sequence. The results in table3.3 shows that the highest similarity in case of the conserved domain sequence (HSF_DNA-bind) is between human and mouse with percentage 100% but in case of the entire sequence of HSF1 the highest similarity is between human and Taurus and these results confirm the results obtained in table3.2 and table3.3. The following figures demonstrate in details and graphically the results obtained by using gene Tracer Algorithm also from these figures we can specify the related parts of each ancestor sequence in the offspring one. 54 Figure: 3.11. Results of Gene Tracer program in the case of the conserved domain sequence (HSF_DNA-bind) 55 Figure: 3.12. Results of gene tracer program in the case of the entire sequence of HSF1 protein sequence 56 3.6. Conclusion Heat shock factor (HSF) is a transcriptional activator of heat shock genes. Heat shock transcription factors (Hsfs) bind to conserved regulatory elements located in the promoters of HSP genes, known as heat shock elements (HSEs) [19]. Upon activation, the HSFs bind to HSEs and interact with proteins of the basal transcription machinery [20] and [21]. The presence of common HSF recognition elements explains the possibility for the same HSFs to be activated between different related organism [22]. The pairwise alignment comparison of HSF1 among different studied eukaryotic organism (e.g., Human, Taurus, Dania rerio, Mouse, Plant (Arabidopsis), Yeast) shows that the human and Taurus are more closer on the entire length of the HSF1 using the scoring matrix BLOSUM30 and BLOSUM50 as shown in table3.2 However, the HSF1 conserved domain (HSF_DNAbind) sequence was more similar between human and mouse comparing to others by using the same blosum matrices as shown in table3.1. Similar results are obtained using multiple sequence alignment. As shown in table3.3 results obtained by using gene tracer algorithm ensured that the conserved domain (HSF_DNA-bind) in mouse is the same in human and the entire sequence of HSF1 protein in human is the same as in Taurus also we can show clearly the related parts between the sequences. One important result from these results is that, if we make a comparison between the results obtained in table3.1 and table3.2 we can specify that for all eukaryotic organisms we are used the degree of similarity in case of the conserved domain (HSF_DNA-bind) is more than the degree of similarity in case of entire sequence of HSF1 protein. These results accepted for publication with the title "Using Sequence Alignment Algorithms to Demonstrate The Genetic Evolution of Heat Shock Factor1 (Hsf1) In Different Eukaryotic Organisms" in an international journal called international journal for engineering science and technology (ESTIJ) [23]. 57 CHAPTER FOUR MATHEMATICAL MODELING AND CLASSIFICATION OF VIRUSES FROM HERPESVIRUS FAMILY 4.1. Overview The process of modeling and classifications of viruses that belong to a specific family is an important for biologist and for many biological applications. There are many ways for Viruses families' classification. The degree of similarity or diversity among the structure of the viruses capsid proteins is very useful in studying the Viruses families' classification and their genetic evolution. It also important propose a mathematical model of the virus life cycle to be able to fully understand the life cycle of Viruses families' activities. In this chapter we introduce a proposed mathematical model for some Herpesvirus family viruses simple life cycle and comprehensive study for its classification using sequence alignment algorithms in order to demonstrate their genetic evolution according to the structure of their capsid protein. Herpesvirus family is considered one of the most important family in the families of the enveloped DNA viruses as it contain many dangerous viruses for human health. This family contain one of newly discovered viruses called Epstein–Barr virus (EBV), also called human herpesvirus 4 (HHV-4), which is one of the most dangerous viruses. Infection with EBV occurs by the oral transfer of saliva [24] and genital secretions. 4.2. Introduction to viruses Viruses are small infectious agents, but it considers the most dangerous infectious agents for eukaryotic organisms. It can replicates only using living cells of the other organisms. Viruses can multiply in living cells. Viruses once enter a host cell they introduce their genetic material, DNA or RNA, inside the host cell. Viruses have a wide range of host cells for infection ranging prokaryotic (e.g. Bacteria or Eukaryotic (e.g. Plants, Animals, Human, Etc..) . A virus has simple structure where it consist of only two components as shown in Fig 1 1. Genome: the genome of viruses is either ribonucleic acid (RNA) or deoxyribonucleic acid (DNA); it contains only one type. 2. Coat or capsid protein: it is the protein that encloses the genome. Figure: 4.1. Virus Structure components 58 DNA viruses has the following structure [25] Figure: 4.2. General DNA Virus structure. Our study is focused on enveloped DNA viruses. 4.3. Enveloped DNA Viruses Enveloped DNA viruses are found in three families [26] as shown in fig4.2: 1. Herpesviridae (Herpesviruse) family 2. Poxviridaea (Boxviruse) family 3. Hepadnaviridae (Hepadnaviruse) family In this study we will focuses on the first family of enveloped DNA viruses which is herpesviruse family colored in yellow, where we will propose a mathematical model for its life cycle and also we introduces a comprehensive study to classify Herpes-virus family members according to the structure of their capsid protein by use different sequence alignment algorithms. 59 4.4. Herpesviridae (Herpesviruse) family Herpesviridae is a large family of DNA viruses that cause diseases in animals, including humans. One of the unique feature of the Viruses in this family is that they have a doublestranded linear molecular structure with icosahedra symmetry [27][28][29]. The most known name of the members of this family is herpesviruses. The word herpesviruses is thought to be derived from the Greek word herpein, meaning ("to creep"), which refers to the latent and recurring infections phenotype of this family of viruses. In this regard, herpesviridae is known to cause latent or lytic infections. This family of viruses contains five species of Herpesviridae which considered dangerous health factor for human health. For instance, HSV-1 and HSV-2 (both of which can cause orolabial herpes and genital herpes), Varicella zoster virus (which causes chicken-pox and shingles), Epstein-Barr virus (which causes mononucleosis) and Cytomegalovirus all of which are considered extremely widespread among humans. There are many reports indicated that at least one of these viruses is infecting 90% , and a latent form of the virus remains in most people [30][31][32]. There are 8 herpesvirus types: Herpes simplex viruses 1 and 2, varicellazoster virus, EBV (Epstein-Barr virus), human cytomegalovirus, human herpesvirus 6, human herpesvirus 7, and Kaposi's sarcoma-associated herpesvirus [33].There are more than 130 known herpesviruses [34], and some are isolated from many organisms such as mammals, birds, fish, reptiles, amphibians, and mollusks [33]. Herpesvirus family contains the following viruses types:1. 2. 3. 4. 5. 6. 7. 8. Herpes simplex1 (HSV1) Herpes simplex2 (HSV2) Epstein-barr virus (EBV) or (HHV4) Human herpesvirus6 (HHV6) Human herpesvirus7 (HHV7) Human herpesvirus8 (HHV8) Varicella-zoster virus (VZV)or (HHV3) Human cytomegalovirus (HCMV) or (HHV5) In this article we will try to classify these viruses according to their capsid protein sequences structure using MatLab algorithms to understand the genetic evolution And the phylogenetic relation of those families. 60 4.5. Mathematical modeling of herpesvirus family life cycle In this study, we will describe a mathematical model for the herpesviruses virus general life cycles uses the host cell’s organelles to replicates new Viruses, which known as Lytic cycle [35]. During this cycle the virus has to infect the cell, So that, the virus attaches itself to the outer cell wall and releases enzymes that weaken the cell wall as shown in Fig.3. Figure: 4.3. Simple diagram represent herpesviruses Lytic Cycle The lytic cycle is considered the main cycle in viral replication. As shown in Fig .3, the viral DNA enters the cell and it transcribes itself into the host cell's messenger RNAs and uses them to direct the ribosomes to produce viruses proteins including the capsid protein. The virus takes over the cell's metabolic activities and the host cell's DNA is destroyed. The virus produces progeny phages using the cell energy for its own propagation. The viruses multiply and the original viruses releases enzymes to break the cell wall. The cell wall bursts, in a process known as lysing. Thus, the new viruses are released [36]. The lytic cycle of a virus consists of six steps [37].The first two stages, called adsorption and penetration. In the adsorption stage, the virus must attaches to a receptor on the cell membrane to be able to enter the cell through the plasma membrane. In the penetration stage, the virus releases its genetic materials into the cell. This is followed by the integration stage in which the host cell gene expression is arrested, and viral materials are embedded into the host cell nucleus. The fourth stage is biosynthesis that the virus uses the cell machinery to make large amount of viral components, and at the meantime, destroys the host's DNA. The last two stages, maturation and lysis where the mature virus particles are formed and released. 61 In this chapter, the basic mathematical model for lytic cycle is proposed in four main stages the model of a lytic cycle is converted into a system of equations. The solution of the equations are solved numerically and described how the lytic phage terminates its infection and breach its host's cell envelope over time t. Then, the reaction scheme is described in the following four reactions as follows: d/dt decay forward forward reverse [x1] -k1[x1] 0 0 k4[x4] [x2] k1[x1] -k2[x2] 0 [x3] 0 [x4] 0 k2[x2] -k3[x3] 0 k3[x3] 0 0 -k4[x4] Table4.1.This table shows the modeling steps symbols The dynamics of the system can be written as the following set of first order ODEs: d[x1]/dt=k4[x4]−k1[x1], d[x2]/dt=k1[x1]–k2[x2], d[x3]/dt=k2[x2]−k3[x3], d[x4]/dt = k3[x3] – k4[x4]. 62 This system of ODEs can be numerically solved by using Matlab program. Then, the mathematical model is solved for some initial values. 4 8 x 10 x1 7 x2 x3 6 x4 solution x 5 4 3 2 1 0 0 10 20 30 40 50 60 70 time t Figure: 4.4.The solution to system of equations produced by ODE45. 4.6. Biological data used in our work All biological data used (which represents the sequences of the capsid protein of the herpesvirus family) was obtained from the protein sequences repository of the national center for biotechnology information (NCBI) and the UniProt Knowledgebase (UniProtKB) which is the central hub for the collection of functional information on proteins. All these data is written in FASTA format which is the most popular format types used in writing sequence for bioinformatics applications, and it is a common format in most biological data bases. 63 1. Data represent the entire sequence of major capsid protein of Herpes simplex virus type 1 >sp|P06491|MCP_HHV11 Major capsid protein OS=Human herpesvirus 1 (strain 17) GN=UL19 PE=1 SV=1 MAAPNRDPPGYRYAAAMVPTGSLLSTIEVASHRRLFDFFSRVRS DANSLYDVEFDALLGSYCNTLSLVRFLELGLSVACVCTKFPELA YMNEGRVQFEVHQPLIARDGPHPIEQPTHNYMTKIIDRRALNAA FSLATEAIALLTGEALDGTGISAHRQLRAIQQLARNVQAVLGAFE RGTADQMLHVLLEKAPPLALLLPMQRYLDNGRLATRVARATLV AELKRSFCETSFFLGKAGHRREAVEAWLVDLTTATQPSVAVPRL THADTRGRPVDGVLVTTAPIKQRLLQSFLKVEDTEADVPVTYGE MVLNGANLVTALVMGKAVRSLDDVGRHLLEMQEEQLDLNRQT LDELESAPQTTRVRADLVSIGEKLVFLEALEKRIYAATNVPYPLV GAMDLTFVLPLGLFNPVMERFAAHAGDLVPAPGHPDPRAFPPRQ LFFWGKDRQVLRLSLEHAIGTVCHPSLMNVDAAVGGLNRDPVE AANPYGAYVAAPAGPAADMQQLFLNAWGQRLAHGRVRWVAE GQMTPEQFMQPDNANLALELHPAFDFFVGVADVELPGGDVPPA GPGEIQATWRVVNGNLPLALCPAAFRDARGLELGVGRHAMAPA TIAAVRGAFDDRNYPAVFYLLQAAIHGSEHVFCALARLVVQCIT SYWNNTRCAAFVNDYSLVSYVVTYLGGDLPEECMAVYRDLVA HVEALAQLVDDFTLTGPELGGQAQAELNHLMRDPALLPPLVWD CDALMRRAALDRHRDCRVSAGGHDPVYAAACNVATADFNRND GQLLHNTQARAADAADDRPHRGADWTVHHKIYYYVMVPAFSR GRCCTAGVRFDRVYATLQNMVVPEIAPGEECPSDPVTDPAHPLH PANLVANTVNAMFHNGRVVVDGPAMLTLQVLAHNMAERTTAL LCSAAPDAGANTASTTNMRIFDGALHAGILLMAPQHLDHTIQNG DYFYPLPVNALFAGADHVANAPNFPPALRDLSRQVPLVPPALGA NYFSSIRQPVVQHVRESAAGENALTYALMAGYFKISPVALHHQL KTGLHPGFGFTVVRQDRFVTENMLFSERASEAYFLGQLQVARHE TGGGVNFTLTQPRGNVDLGVGYTAVVATATVRNPVTDMGNLP QNFYLGRGAPPLLDNAAAVYLRNAVVAGNRLGPAQPVPVFGCA QVPRRAGMDHGQDAVCEFIATPVSTDVNYFRRPCNPRGRAAGG VYAGDKEGDVTALMYDHGQSDPSRAFAATANPWASQRFSYGD LLYNGAYHLNGASPVLSPCFKFFTSADIAAKHRCLERLIVETGSA 64 VSTATAASDVQFKRPPGCRELVEDPCGLFQEAYPLTCASDPALL RSARNGEAHARETHFAQYLVYDASPLKGLAL 2. Data represent the entire sequence of the Major capsid protein of Human herpes simplex virus type2 >gi|360039880|gb|AEV91357.1| major capsid protein [Human herpesvirus 2] MAAPARDPPGYRYAAAILPTGSILSTIEVASHRRLFDFFAAVRSD ENSLYDVEFDALLGSYCNTLSLVRFLELGLSVACVCTKFPELAY MNEGRVQFEVHQPLIARDGPHPVEQPVHNYMTKVIDRRALNAA FSLATEAIALLTGEALDGTGISLHRQLRAIQQLARNVQAVLGAFE RGTADQMLHVLLEKAPPLALLLPMQRYLDNGRLATRVARATLV AELKRSFCDTSFFLGKAGHRREAIEAWLVDLTTATQPSVAVPRL THADTRGRPVDGVLVTTAAIKQRLLQSFLKVEDTEADVPVTYGE MVLNGANLVTALVMGKAVRSLDDVGRHLLDMQEEQLEANRET LDELESAPQTTRVRADLVAIGDRLVFLEALERRIYAATNVPYPLV GAMDLTFVLPLGLFNPAMERFAAHAGDLVPAPGHPEPRAFPPRQ LFFWGKDHQVLRLSMENAVGTVCHPSLMNIDAAVGGVNHDPV EAANPYGAYVAAPAGPGADMQQRFLNAWRQRLAHGRVRWVA ECQMTAEQFMQPDNANLALELHPAFDFFAGVADVELPGGEVPP AGPGAIQATWRVVNGNLPLALCPVAFRDARGLELGVGRHAMAP ATIAAVRGAFEDRSYPAVFYLLQAAIHGNEHVFCALARLVTQCIT SYWNNTRCAAFVNDYSLVSYIVTYLGGDLPEECMAVYRDLVAH VEALAQLVDDFTLPGPELGGQAQAELNHLMRDPALLPPLVWDC DGLMRHAALDRHRDCRIDAGGHEPVYAAACNVATADFNRNDG RLLHNTQARAADAADDRPHRPADWTVHHKIYYYVLVPAFSRGR CCTAGVRFDRVYATLQNMVVPEIAPGEECPSDPVTDPAHPLHPA NLVANTVKRMFHNGRVVVDGPAMLTLQVLAHNMAERTTALLC SAAPDAGANTASTANMRIFDGALHAGVLLMAPQHLDHTIQNGE YFYVLPVHALFAGADHVANAPNFPPALRDLARDVPLVPPALGA NYFSSIRQPVVQHARESAAGENALTYALMAGYFKMSPVALYHQ LKTGLHPGFGFTVVRQDRFVTENVLFSERASEAYFLGQLQVARH ETGGGVNFTLTQPRGNVDLGVGYTAVAATGTVRNPVTDMGNLP QNFYLGRGAPPLLDNAAAVYLRNAVVAGNRLGPAQPLPVFGCA QVPRRAGMDHGQDAVCEFIATPVATDINYFRRPCNPRGRAAGG VYAGDKEGDVIALMYDHGQSDPARPFAATANPWASQRFSYGDL LYNGAYHLNGASPVLSPCFKFFTAADITAKHRCLERLIVETGSAV 65 STATAASDVQFKRPPGCRELVEDPCGLFQEAYPITCASDPALLRS ARDGEAHARETHFTQYLIYDASPLKGLSL 3. Data represent the entire sequence of the Major capsid protein of Epstein-Barr virus (EBV) >tr|V5KU49|V5KU49_EBVG Major capsid protein OS=Epstein-Barr virus (strain GD1) GN=BcLF1 PE=4 SV=1 MASNEGVENRPFPYLTVDADLLSNLRQSAAEGLFHSFDLLVGKD AREAGIKFEVLLGVYTNAIQYVRFLETALAVSCVNTEFKDLSRM TDGKIQFRISVPTIAHGDGRRPSKQRTFIVVKNCHKHHISTEMELS MLDLEILHSIPETPVEYAEYVGAVKTVASALQFGVDALERGLINT VLSVKLRHAPPMFILQTLADPTFTERGFSKTVKSDLIAMFKRHLL EHSFFLDRAENMGSGFSQYVRSRLSEMVAAVSGESVLKGVSTYT TAKGGEPVGGVFIVTDNVLRQLLTFLGEEADNQIMGPSSYASFV VRGENLVTAVSYGRVMRTFEHFMARIVDSPEKAGSTKSDLPAV AAGVEDQPRVPISAAVIKLGNHAVAVESLQKMYNDTQSPYPLNR RMQYSYYFPVGLFMPNPKYTTSAAIKMLDNPTQQLPVEAWIVN KNNLLLAFNLQNALKVLCHPRLHTPAHTLNSLNAAPAPRDRRET YSLQHRRPNHMNVLVIVDEFYDNKYAAPVTDIALKCGLPTEDFL HPSNYDLLRLELHPLYDIYIGRDAGERARHRAVHRLMVGNLPTP LAPAAFQEARGQQFETATSLAHVVDQAVIETVQDTAYDTAYPAF FYVVEAMIHGFEEKFVMNVPLVSLCINTYWERSGRLAFVNSFSM IKFICRHLGNNAISKEAYSMYRKIYGELIALEQALMRLAGSDVVG DESVGQYVCALLDPNLLPPVAYTDIFTHLLTVSDRAPQIIIGNEVY ADTLAAPQFIERVGNMDEMAAQFVALYGYRVNGDHDHDFRLH LGPYVDEGHADVLEKIFYYVFLPTCTNAHMCGLGVDFQHVAQT LAYNGPAFSHHFTRDEDILDNLENGTLRDLLEISDLRPTVGMIRD LSASFMTCPTFTRAVRVSVDNDVTQQLAPNPADKRTEQTVLVN GLVAFAFSERTRAVTQCLFHAIPFHMFYGDPRVAATMHQDVATF VMRNPQQRAVEAFNRPEQLFAEYREWHRSPMGKYAAECLPSLV SISGMTAMHIKMSPMAYIAQAKLKIHPGVAMTVVRTDEILSENIL FSSRASTSMFIGTPNVSRREARVDAVTFEVHHEMASIDTGLSYSS TMTPARVAAITTDMGIHTQDFFSVFPAEAFGNQQVNDYIKAKVG AQRNGTLLRDPRTYLAGMTNVNGAPGLCHGQQATCEIIVTPVTA DVAYFQKSNSPRGRAACVVSCENYNQEVAEGLIYDHSRPDAAY 66 EYRSTVNPWASQLGSLGDIMYNSSYRQTAVPGLYSPCRAFFNKE ELLRNNRGLYNMVNEYSQRLGGHPATSNTEVQFVVIAGTDVFLE QPCSFLQEAFPALSASSRALIDEFMSVKQTHAPIHYGHYIIEEVAP VRRILKFGNKVVF 4. Data represent the entire sequence of the Major capsid protein of Human herpesvirus 6 (HHV6) >sp|P17887|MCP_HHV6U Major capsid protein OS=Human herpesvirus 6A (strain Uganda-1102) GN=U57 PE=3 SV=3 MENWQATEILPKIEAPLNIFNDIKTYTAEQLFDNLRIYFGDDPSRY NISFEALLGIYCNKIEWINFFTTPIAVAANVIRFNDVSRMTLGKVL FFIQLPRVATGNDVTASKETTIMVAKHSEKHPINISFDLSAACLEH LENTFKNTVIDQILNINALHTVLRSLKNSADSLERGLIHAFMQTLL RKSPPQFIVLTMNENKVHNKQALSRVQRSNMFQSLKNRLLTSLF FLNRNNNISYIYRILNDMMESVTESILNDTNNYTSKENVPLDGVL LGPIGSIQKLTSILSQYISTQVVSAPISYGHFIMGKENAVTAIAYRA IMADFTQFTVNAGTEQQDTNNKSEIFDKSRAYADLKLNTLKLGD KLVAFDHLHKVYKNTDVNDPLEQSLQLTFFFPLGIYIPSETGFST METRVKLNDTMENNLPTSVFFHNKDQVVQRIDFADILPSVCHPI VHDSTIVERLMKSEPLPTGHRFSQLCQLKITRENPARILQTLYNLY ESRQEVPKNTNVLKNELNIEDFYKPDNPTLPTERHPFFDLTYIQK NRATEVLCTPRIMIGNIPLPLAPVSFHEARTNQILEHAKTNCQKY DFTLKIVTESLTSGSYPELAYVIETLVHGNKHAFMILKQVISQCIS YWFNMKHILLFCNSFEMIMLISNHMGDELIPGAAFAHYRNLVSLI RLVKRTISISNLNEQLCGEPLVNFANALFDGRLFCPFVHTMPRND TNAKITADDTPLTQNTVRVRNYEISDVQRMNLIDSSVVFTDNDR PSNETTILSKIFYFCVLPALSNNKACGAGVNVKELVLDLFYTEPFI SPDDYFQENPITSDVLMSLIREGMGPGYTVANTSCIAKQLFKSLIY INENTKILEVEVSLDPAQRHGNSVHFQSLQHILYNGLCLISPITTLR RYYQPIPFHRFFSDPGICGTMNADIQVFLNTFPHCQRNDGGFPLPP PLALEFYNWQRTPFSVYSAFCPNSLLSIMTLAAMHSKLSPVAIAI QSKNKIHPGFAATLVRTDNFDVECLLYSSRAATSIILDDPTVTAE AKDIATTYNFTQHLSFVDMGLGFSSTTATANLKRIKSDMGSKIQ NLFSAFPIHAFTNADINTWIRHHVGIEKPNPSESEALNIITFGGINK NPPSILLHGQQAICEVILTPVTTNINFFKSPHNPRGRESCMMGTDP HNEEAARKALYDHTQTDSDTFAATTNPWASLPGSLGDILYNTAH 67 REQLCYNPKTYSPNAQFFTESDILKTNKMMYKVISEYCMKSNSC LNSDSEIQYSCSEGTDSFVSRPCQFLQNALPLHCSSNQALLESRSK TGNTQISETHYCNYAIGETIPFQLIIESSI 5. Data represent the entire sequence of the Major capsid protein of Human herpesvirus 7 (HHV7) >sp|P52347|MCP_HHV7J Major capsid protein OS=Human herpesvirus 7 (strain JI) GN=U57 PE=3 SV=1 MENWRTAEIFPKLDVSPNVFDDIRTQTAEQLFENLRLYYGDDSD RYNISFEALLGIYCNRTEWIDFFHTSIAVAANVIRFNDLDKMSLG KILFYIQLPRVATGNDVTAPKETTVLVTKYSEKHPINISFELSAAC LAHLENTFKNTILDQMLNINAIHTVLRSLKNSADSLQRGLIYAFIK TILKKAPPQFILKTMLENKVNSKQILSKVQRSNMFQNFKNKLINS LFFLNRTSNVSFIYRYLCEMVDSTTESILNNTNSYVLKDGTPINGV LLGTPNTIQILSNALSQHISQMTMSVPVSYGTFVMGKENAVTAIA YQAIMADFSNYTKNVATETQDQNKKSEIFENQTQHADLKTNIIQ LSDKTVVLDHLKKVYKNTNIEDPLEQKLELTFFFPMGLYISKDSG FSTMDSRLKLNDTMENNLPTSIYFYNKDKLLQRIDYSDLLPSLCH PIIFDCSVSERIFKNAAKPTGESFNQLCQVEFVREPPSTFLSNLYNL YEMKKEIPKTTNMLKNELTTEDFYKSENFTLKTELHPFFDFTYIQ KNRSTDVLCSPRILLGNIPLPLAPSSFHEARTNQMIEQAKTNNLN YDYTLKLVVESLTNTAYPELAYIIELLIHGNKTAFQILKDVVSQCI TYWYNIKHILLFCNNFEMIWLITTYLGDESIPGIAYTHYKNIISILK LVKRTISISNFNEQLCGEPLVGFVNALFDNRLFPPFLNSLPKNEAN AIITAGNTPLTQNTVKLRNYEVSDLNRMNLLDSTEIFTDVDRPSF ETIVLSKIFYFCFLPALTNNKMCGAGFDVKSFILDFFYTEPFILPDD NFCELPITNNVLIELITEAVGPSHALTDLSCIGKQLFKSILYLTENT KILEIESSLDPSQRHGSSSNFKSLQHVLYNGLCLVSPINVLKRYFK PIPFNRFFSDPIICGLMNIEVQTYLNIFPHYQRNDGGFPLPQALSHE FHNWQRTPFFVYASCCSNSLLSIMTLATMHCKLSPIAIILQSRQKI HPGFAATLVRTDCFDINCLLYSSKSATSIMIDDPTVSTEVKDISTT YNLTQHISFLDMGLGFSSSTAIANLKRVKTDMGSKVQDLFSVFP MHAYTNPTVNSWVRHHVGIEKPNPSETDALNILSFGKINKQSQSI LLHGQQAICEVVITPVTSDINFYKTPKNPRGRASCMMGVDPHNE SEARKSLYDHSRVDSDAFVATTNPWASQEGSLSDVLYNINHRDQ LGYNPKSYSPNAVFFTDTEIFKTNKFMFKLISDYSIKTKTCLDSDT 68 DIQYSCSEGTDDVTHRPCQFLQIAFPIHCSSNQALLESRSKNGMT QLSETHFANFAIGECIPLQNIIESLL 6. Data represent the entire sequence of the Major capsid protein of Human herpesvirus 8 (HHV8) >sp|Q2HRA7|MCP_HHV8P Major capsid protein OS=Human herpesvirus 8 type P (isolate GK18) GN=ORF25 PE=3 SV=1 MEATLEQRPFPYLATEANLLTQIKESAADGLFKSFQLLLGKDARE GSVRFEALLGVYTNVVEFVKFLETALAAACVNTEFKDLRRMIDG KIQFKISMPTIAHGDGRRPNKQRQYIVMKACNKHHIGAEIELAAA DIELLFAEKETPLDFTEYAGAIKTITSALQFGMDALERGLVDTVL AVKLRHAPPVFILKTLGDPVYSERGLKKAVKSDMVSMFKAHLIE HSFFLDKAELMTRGKQYVLTMLSDMLAAVCEDTVFKGVSTYTT ASGQQVAGVLETTDSVMRRLMNLLGQVESAMSGPAAYASYVV RGANLVTAVSYGRAMRNFEQFMARIVDHPNALPSVEGDKAALA DGHDEIQRTRIAASLVKIGDKFVAIESLQRMYNETQFPCPLNRRIQ YTYFFPVGLHLPVPRYSTSVSVRGVESPAIQSTETWVVNKNNVPL CFGYQNALKSICHPRMHNPTQSAQALNQAFPDPDGGHGYGLRY EQTPNMNLFRTFHQYYMGKNVAFVPDVAQKALVTTEDLLHPTS HRLLRLEVHPFFDFFVHPCPGARGSYRATHRTMVGNIPQPLAPRE FQESRGAQFDAVTNMTHVIDQLTIDVIQETAFDPAYPLFCYVIEA MIHGQEEKFVMNMPLIALVIQTYWVNSGKLAFVNSYHMVRFICT HIGNGSIPKEAHGHYRKILGELIALEQALLKLAGHETVGRTPITHL VSALLDPHLLPPFAYHDVFTDLMQKSSRQPIIKIGDQNYDNPQNR ATFINLRGRMEDLVNNLVNIYQTRVNEDHDERHVLDVAPLDEN DYNPVLEKLFYYVLMPVCSNGHMCGMGVDYQNVALTLTYNGP VFADVVNAQDDILLHLENGTLKDILQAGDIRPTVDMIRVLCTSFL TCPFVTQAARVITKRDPAQSFATHEYGKDVAQTVLVNGFGAFA VADRSREAAETMFYPVPFNKLYADPLVAATLHPLLPNYVTRLPN QRNAVVFNVPSNLMAEYEEWHKSPVAAYAASCQATPGAISAMV SMHQKLSAPSFICQAKHRMHPGFAMTVVRTDEVLAEHILYCSRA STSMFVGLPSVVRREVRSDAVTFEITHEIASLHTALGYSSVIAPAH VAAITTDMGVHCQDLFMIFPGDAYQDRQLHDYIKMKAGVQTGS PGNRMDHVGYTAGVPRCENLPGLSHGQLATCEIIPTPVTSDVAY FQTPSNPRGRAASVVSCDAYSNESAERLFYDHSIPDPAYECRSTN 69 NPWASQRGSLGDVLYNITFRQTALPGMYSPCRQFFHKEDIMRYN RGLYTLVNEYSARLAGAPATSTTDLQYVVVNGTDVFLDQPCHM LQEAYPTLAASHRVMLAEYMSNKQTHAPVHMGQYLIEEVAPM KRLLKLGNKVVY 7. Data represent the entire sequence of the Major capsid protein of Human cytomegalovirus >sp|P16729|MCP_HCMVA Major capsid protein OS=Human cytomegalovirus (strain AD169) GN=UL86 PE=3 SV=1 MENWSALELLPKVGIPTDFLTHVKTSAGEEMFEALRIYYGDDPE RYNIHFEAIFGTFCNRLEWVYFLTSGLAAAAHAIKFHDLNKLTTG KMLFHVQVPRVASGAGLPTSRQTTIMVTKYSEKSPITIPFELSAA CLTYLRETFEGTILDKILNVEAMHTVLRALKNTADAMERGLIHSF LQTLLRKAPPYFVVQTLVENATLARQALNRIQRSNILQSFKAKM LATLFLLNRTRDRDYVLKFLTRLAEAATDSILDNPTTYTTSSGAK ISGVMVSTANVMQIIMSLLSSHITKETVSAPATYGNFVLSPENAV TAISYHSILADFNSYKAHLTSGQPHLPNDSLSQAGAHSLTPLSMD VIRLGEKTVIMENLRRVYKNTDTKDPLERNVDLTFFFPVGLYLPE DRGYTTVESKVKLNDTVRNALPTTAYLLNRDRAVQKIDFVDAL KTLCHPVLHEPAPCLQTFTERGPPSEPAMQRLLECRFQQEPMGG AARRIPHFYRVRREVPRTVNEMKQDFVVTDFYKVGNITLYTELH PFFDFTHCQENSETVALCTPRIVIGNLPDGLAPGPFHELRTWEIME HMRLRPPPDYEETLRLFKTTVTSPNYPELCYLVDVLVHGNVDAF LLIRTFVARCIVNMFHTRQLLVFAHSYALVTLIAEHLADGALPPQ LLFHYRNLVAVLRLVTRISALPGLNNGQLAEEPLSAYVNALHDH RLWPPFVTHLPRNMEGVQVVADRQPLNPANIEARHHGVSDVPR LGAMDADEPLFVDDYRATDDEWTLQKVFYLCLMPAMTNNRAC GLGLNLKTLLVDLFYRPAFLLMPAATAVSTSGTTSKESTSGVTPE DSIAAQRQAVGEMLTELVEDVATDAHTPLLQACRELFLAVQFV GEHVKVLEVRAPLDHAQRQGLPDFISRQHVLYNGCCVVTAPKT LIEYSLPVPFHRFYSNPTICAALSDDIKRYVTEFPHYHRHDGGFPL PTAFAHEYHNWLRSPFSRYSATCPNVLHSVMTLAAMLYKISPVS LVLQTKAHIHPGFALTAVRTDTFEVDMLLYSGKSCTSVIINNPIVT KEERDISTTYHVTQNINTVDMGLGYTSNTCVAYVNRVRTDMGV RVQDLFRVFPMNVYRHDEVDRWIRHAAGVERPQLLDTETISML TFGSMSERNAAATVHGQKAACELILTPVTMDVNYFKIPNNPRGR 70 ASCMLAVDPYDTEAATKAIYDHREADAQTFAATHNPWASQAG CLSDVLYNTRHRERLGYNSKFYSPCAQYFNTEEIIAANKTLFKTI DEYLLRAKDCIRGDTDTQYVCVEGTEQLIENPCRLTQEALPILST TTLALMETKLKGGAGAFATSETHFGNYVVGEIIPLQQSMLFNS 8. Data represent the entire sequence of the Major capsid protein Varicella-zoster virus >sp|P09245|MCP_VZVD Major capsid protein OS=Varicella-zoster virus (strain Dumas) GN=40 PE=3 SV=1 MTTVSCPANVITTTESDRIAGLFNIPAGIIPTGNVLSTIEVCAHRCI FDFFKQIRSDDNSLYSAQFDILLGTYCNTLNFVRFLELGLSVACIC TKFPELAYVRDGVIQFEVQQPMIARDGPHPVDQPVHNYMVKRIH KRSLSAAFAIASEALSLLSNTYVDGTEIDSSLRIRAIQQMARNLRT VLDSFERGTADQLLGVLLEKAPPLSLLSPINKFQPEGHLNRVARA ALLSDLKRRVCADMFFMTRHAREPRLISAYLSDMVSCTQPSVM VSRITHTNTRGRQVDGVLVTTATLKRQLLQGILQIDDTAADVPV TYGEMVLQGTNLVTALVMGKAVRGMDDVARHLLDITDPNTLNI PSIPPQSNSDSTTAGLPVNARVPADLVIVGDKLVFLEALERRVYQ ATRVAYPLIGNIDITFIMPMGVFQANSMDRYTRHAGDFSTVSEQ DPRQFPPQGIFFYNKDGILTQLTLRDAMGTICHSSLLDVEATLVA LRQQHLDRQCYFGVYVAEGTEDTLDVQMGRFMETWADMMPH HPHWVNEHLTILQFIAPSNPRLRFELNPAFDFFVAPGDVDLPGPQ RPPEAMPTVNATLRIINGNIPVPLCPISFRDCRGTQLGLGRHTMTP ATIKAVKDTFEDRAYPTIFYMLEAVIHGNERNFCALLRLLTQCIR GYWEQSHRVAFVNNFHMLMYITTYLGNGELPEVCINIYRDLLQH VRALRQTITDFTIQGEGHNGETSEALNNILTDDTFIAPILWDCDAL IYRDEAARDRLPAIRVSGRNGYQALHFVDMAGHNFQRRDNVLI HGRPVRGDTGQGIPITPHHDREWGILSKIYYYIVIPAFSRGSCCTM GVRYDRLYPALQAVIVPEIPADEEAPTTPEDPRHPLHAHQLVPNS LNVYFHNAHLTVDGDALLTLQELMGDMAERTTAILVSSAPDAG AATATTRNMRIYDGALYHGLIMMAYQAYDETIATGTFFYPVPV NPLFACPEHLASLRGMTNARRVLAKMVPPIPPFLGANHHATIRQP VAYHVTHSKSDFNTLTYSLLGGYFKFTPISLTHQLRTGFHPGIAFT VVRQDRFATEQLLYAERASESYFVGQIQVHHHDAIGGVNFTLTQ PRAHVDLGVGYTAVCATAALRCPLTDMGNTAQNLFFSRGGVPM LHDNVTESLRRITASGGRLNPTEPLPIFGGLRPATSAGIARGQASV 71 CEFVAMPVSTDLQYFRTACNPRGRASGMLYMGDRDADIEAIMF DHTQSDVAYTDRATLNPWASQKHSYGDRLYNGTYNLTGASPIY SPCFKFFTPAEVNTNCNTLDRLLMEAKAVASQSSTDTEYQFKRPP GSTEMTQDPCGLFQEAYPPLCSSDAAMLRTAHAGETGADEVHL AQYLIRDASPLRGCLPLPR 4.7. Results and discussion 4.7.1. Pairwise global alignment among the entire sequences of herpesviruses family First, we have made a code by using MATLAB program to make pairwise global alignment among the entire sequences of herpesviruses family . The results were as follow 1) The results obtained by using sequence dot plot matrix herpes simplex virus2 0 0 200 400 600 800 200 herpes simplex virus1 400 600 800 1000 1200 dot plot matrix between HSV1 & HSV2 72 1000 1200 EBV 0 0 200 400 600 800 1000 1200 200 herpes simplex virus1 400 600 800 1000 1200 dot plot matrix between HSV1 & EBV herpes simplex virus1 0 0 200 400 600 800 200 HHV6 400 600 800 1000 1200 dot plot matrix between HSV1 & HHV6 73 1000 1200 herpes simplex virus1 0 0 200 400 600 800 1000 1200 200 400 HHV7 600 800 1000 1200 dot plot matrix between HSV1 & HHV7 HHV8 0 0 200 400 600 800 200 herpes simplex virus1 400 600 800 1000 1200 dot plot matrix between HSV1 & HHV8 74 1000 1200 varicella-zoster virus 0 0 200 400 600 800 1000 1200 200 herpes simplex virus1 400 600 800 1000 1200 dot plot matrix between HSV1 & VZV herpes simplex virus1 0 0 200 400 600 800 200 human cytomegalovirus 400 600 800 1000 1200 dot plot matrix between HSV1 & HCMV 75 1000 1200 EBV 0 0 200 400 600 800 1000 1200 200 HHV6 400 600 800 1000 1200 dot plot matrix between EBV & HHV6 EBV 0 0 200 400 600 800 200 herpes simplex virus2 400 600 800 1000 1200 dot plot matrix between HSV2 & EBV 76 1000 1200 EBV 0 0 200 400 600 800 1000 1200 200 400 HHV8 600 800 1000 1200 dot plot matrix between EBV & HHV8 EBV 0 0 200 400 600 800 200 HHV7 400 600 800 1000 1200 dot plot matrix between EBV & HHV7 77 1000 1200 varicella-zoster virus 0 0 200 400 600 800 1000 1200 200 400 EBV 600 800 1000 1200 dot plot matrix between VZV & EBV EBV 0 0 200 400 600 800 200 human cytomegalo virus 400 600 800 1000 1200 dot plot matrix between EBV & HCMV 78 1000 1200 herpes simplex virus2 0 0 200 400 600 800 1000 1200 1000 1200 200 HHV6 400 600 800 1000 1200 dot plot matrix between HSV2 & HHV6 herpes simplex virus2 0 0 200 400 600 800 200 HHV7 400 600 800 1000 1200 dot plot matrix between HSV2 & HHV7 79 HHV8 0 0 200 400 600 800 1000 1200 200 herpes simplex virus2 400 600 800 1000 1200 dot plot matrix between HSV2 & HHV8 herpes simplex virus2 0 0 200 400 600 800 200 human cytomegalo virus 400 600 800 1000 1200 dot plot matrix between HSV2 & HCMV 80 1000 1200 HHV7 0 0 200 400 600 800 1000 1200 200 400 HHV6 600 800 1000 1200 dot plot matrix between HHV6 & HHV7 HHV8 0 0 200 400 600 800 200 HHV6 400 600 800 1000 1200 dot plot matrix between HHV6 & HHV8 81 1000 1200 varicella-zoster virus 0 0 200 400 600 800 1000 1200 200 HHV6 400 600 800 1000 1200 dot plot matrix between HHV6 & VZV human cytomegalo virus 0 0 200 400 600 800 200 HHV6 400 600 800 1000 1200 dot plot matrix between HHV6 & HCMV 82 1000 1200 HHV8 0 0 200 400 600 800 1000 1200 200 HHV7 400 600 800 1000 1200 dot plot matrix between HHV7 & HHV8 varicella-zoster virus 0 0 200 400 600 800 200 HHV7 400 600 800 1000 1200 dot plot matrix between HHV7 & VZV 83 1000 1200 human cytomegalo virus 0 0 200 400 600 800 1000 1200 200 HHV7 400 600 800 1000 1200 dot plot matrix between HHV7 & HCMV varicella-zoster virus 0 0 200 400 600 800 200 400 HHV8 600 800 1000 1200 dot plot matrix between HHV8 & VZV 84 1000 1200 HHV8 0 0 200 400 600 800 1000 1200 1000 1200 200 human cytomegalo virus 400 600 800 1000 1200 dot plot matrix between HHV8 & HCMV varicella-zoster virus 0 0 200 400 600 800 200 herpes simplex virus2 400 600 800 1000 1200 dot plot matrix between HSV2 & VZV 85 varicella-zoster virus 0 0 200 400 600 800 1000 1200 200 human cytomegalo virus 400 600 800 1000 1200 dot plot matrix between VZV & HCMV Figure: 4.5. Results of sequence dot plot matrix among the entire sequences of their capsid protein. The results obtained in figure4.5 specify that:1. Capsid protein of herpes simplex1 is more similar to capsid protein of herpes simplex2 and also capsid protein of varicella-zoster virus. 2. Capsid protein of herpes simplex2 is more similar to capsid protein of varicellazoster virus. 3. Capsid protein of Epstein barr virus is more similar to capsid protein of human herpes virus8. 4. Capsid protein of human herpes virus6 is more similar to capsid protein of human herpes virus7 and also capsid protein of human cytomegalo virus. 5. Capsid protein of human cytomegalo virus is more similar to capsid protein of human herpes virus7. 86 2.The results obtained by using Needleman-Wunsch algorithm In our work we use the following scoring matrices:1Blosum50 2Blosum30 Herpesviruses Family Score by using Blosum50 Identities Score by using Blosum30 Identities HSV1& HSV2 2940.33 94% 1848.8 94% HSV1 & EBV 537.333 29% 423.2 28% HSV1 & HHV6 503 26% 392 26% HSV1 & HHV7 480 25% 373 25% HSV1& HHV8 573.667 31% 438.8 30% HSV1& Varicella-Zoster Virus 1641.67 52% 1061.4 52% HSV1& Human Cytomegalo Virus 468.667 28% 375.4 27% EBV & HSV2 541.667 28% 431.4 28% EBV& HHV6 747.667 31% 533.8 31% EBV& HHV7 726.667 32% 517.4 31% EBV& HHV8 1780 56% 1128.6 56% EBV & Varicella-Zoster Virus 551.667 28% 412.8 27% EBV & Human Cytomegalo Virus 722.667 30% 507.8 30% HSV2 & HHV6 493.667 26% 392.2 26% HSV2 & HHV7 472 24% 372.8 25% HSV2& HHV8 568 30% 436.6 30% HSV2 & Varicella-Zoster Virus 1635 52% 1060 52% HSV2& Human Cytomegalo Virus 470.333 27% 377 27% HHV6 & HHV7 2182.67 68% 1371.4 68% HHV6& HHV8 707 30% 500.2 30% HHV6& Varicella-Zoster Virus 466.667 26% 371.4 26% HHV6 & Human Cytomegalo Virus 1400.33 44% 899.8 44% HHV7 & HHV8 710.667 31% 498 31% HHV7 & Varicella-Zoster Virus 458 27% 359 26% HHV7 & Human Cytomegalo Virus 1395.33 43% 893 44% HHV8 &Varicella-Zoster Virus 555 29% 412.4 29% HHV8 & Human Cytomegalo Virus 711.333 30% 506 30% Varicella-Zoster Virus & Human Cytomegalo Virus 432.667 26% 347.2 26% Table 4.2.This table shows the score and the percentage of similarity of the pairwise global alignment among the entire sequences of capsid protein of herpesvirus family. 87 The results obtained in this table ensure that specified in figure4.5. 4.7.2. Multiple sequence alignment (MSA) among the entire sequences of the capsid protein of herpesvirus family. Second we have made a code by using MATLAB program to create multiple sequence alignment (MSA) among all the entire sequences of the capsid protein sequence. Multiple sequence alignment (MSA) is used to ensure and specify the results obtained in table4.2. The results are illustrated by using phylogenetic tree. In our work we use the following scoring matrices:1. BLOSUM60 2. BLOSUM30 3. PAM10 a) Result obtain by using the scoring matrix BLOSUM60 MSA of herpesvirus family by using scoring matrix blosum60 EBV EBV HHV8 HHV8 Branch 3 Branch 4 HHV6 HHV6 Branch 5 Branch 6 Root HHV7 HHV7 HCMV HCMV HSV2 HSV2 HSV1 HSV1 Branch 1 Branch 2 VZV VZV 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Figure: 4.6. Results of MSA among the entire sequences of herpesvirus family by using the scoring matrix BLOSUM60 88 b) Result obtain by using the scoring matrix BLOSUM30 MSA of herpesvirus family by using scoring matrix blosum30 EBV EBV HHV8 HHV8 Branch 3 Branch 4 HHV6 HHV6 Branch 5 Branch 6 Root Branch 1 Branch 2 HHV7 HHV7 HCMV HCMV HSV2 HSV2 HSV1 HSV1 VZV VZV 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Figure: 4.7. Results of MSA among the entire sequences of herpesvirus family by using the scoring matrix BLOSUM30 c) Result obtain by using the scoring matrix PAM10 MSA of herpesvirus family by using scoring matrix PAM10 Branch 3 HHV8 HHV8 EBV EBV Branch 4 HHV6 HHV6 Branch 6 Branch 5 HHV7 HHV7 Root HCMV HCMV Branch 1 Branch 2 HSV2 HSV2 HSV1 HSV1 VZV VZV 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Figure: 4.8. Results of MSA among the entire sequences of herpesvirus family by using the scoring matrix BLOSUM30 89 4.8. conclusion From the previous results obtained in figures from 4.5 to 4.8 and table 4.2 we can classify the herpesvirus family according to the similarity of the structure of their capsid protein into three categories as follows:1. The first category contain a) Herpes simplex virus1 (HSV1) b) Herpes simplex virus2 (HSV2) c) Varicella-zoster virus VZV (HSV3) 2. The second category contain a) Human herpesvirus6 (HHV8) b) Epstein barr virus (EBV) 3. The third category contain a) Human herpesvirus7 (HHV7) b) Human herpesvirus8 (HHV6) c) Human cytomegalo virus HCMV (HHV5) Figure: 4.9. This figure represent the classification of herpesvirus family according to the structure of their capsid protein 90 The differences in capsid gene activation and expression need to be studies in depth by analysis it’s promoter region [39]. In addition, the structure similarity and evolutionary relationship between those viruses family and Genomic Retrotransposon based on structure [40, 41] and activation [42, 43, 44, 45] is currently under investigation. These results accepted for publication with the title " Mathematical Modeling And Classification Of Viruses From Herpesvirus Family " in an international journal called international journal of computer applications (IJCA). 91 CHAPTER FIVE CONCLUSION AND OUTLOOK 5.1. conclusion The main contributions are proposed in this thesis are as follow:I. The pairwise alignment comparison of HSF1 among different studied eukaryotic organism (e.g., Human, Taurus, Dania rerio, Mouse, Plant (Arabidopsis), Yeast) shows that the human and Taurus are more closer on the entire length of the HSF1 using the scoring matrix BLOSUM30 and BLOSUM50 as shown in table3.2 However, the HSF1 conserved domain (HSF_DNA-bind) sequence was more similar between human and mouse comparing to others by using the same blosum matrices as shown in table3.1. Similar results are obtained using multiple sequence alignment. As shown in table3.3 results obtained by using gene tracer algorithm ensured that the conserved domain (HSF_DNA-bind) in mouse is the same in human and the entire sequence of HSF1 protein in human is the same as in Taurus also we can show clearly the related parts between the sequences. II. One important result is that, if we make a comparison between the results obtained in table3.1 and table3.2 we can specify that for all eukaryotic organisms we are used the degree of similarity in case of the conserved domain (HSF_DNA-bind) is more than the degree of similarity in case of entire sequence of HSF1 protein. III. From the previous results obtained in figures from 4.5 to 4.8 and table 4.2. we can classify the herpesvirus family according to the similarity of the structure of their capsid protein into three categories as follows:1. The first category contain a) Herpes simplex virus1 (HSV1) b) Herpes simplex virus2 (HSV2) c) Varicella-zoster virus VZV (HSV3) 92 2. The second category contain a) Human herpesvirus6 (HHV8) b) Epstein barr virus (EBV) 3. The third category contain a) Human herpesvirus7 (HHV7) b) Human herpesvirus8 (HHV6) c) Human cytomegalo virus HCMV (HHV5) The same results obtained by biologist, but according to another criteria rather than the similarity of the structure in capsid protein. 5.2. OUTLOOK The genetic evolution that occur among different types of HSF in human (e.g., HSF1, HSF2, HSF4, and HSF5), and HSF1 in eukaryotic organisms (e.g., Human, Taurus, Dania rerio, Mouse, Plant (Arabidopsis), Yeast) need to be studied. Also the differences in capsid gene activation and expression need to be studies in depth by analysis it’s promoter region. In addition, the structure similarity and evolutionary relationship between those viruses family and Genomic Retrotransposon based on structure and activation is currently under investigation. 93 REFERENCES [1] Angelov, S.P " Pattern Discovery in Biological Data Set" Ph. D. Thesis, Pennsylvania University, 24-34, (2007). [2] Wu, X " Improving the Performance and Precision of Bioinformatics Algorithms" Ph. D. Thesis, University of Maryland, 1-13, (2008). [3] Sinha, S " Algorithms for Finding Regulatory Motifs in DNA Sequences" Ph. D. Thesis, Department of Computer Science and Engineering, University of Washington, 2-16, (2002). [4] Mona snigh "Topics in computational molecular biology", lecture2, in September 22, (1999). [5] Gusfield, D " Algorithms on Strings, Trees, and Sequences" Computer Science and Computational Biology. Cambridge University Press, (1997). [6] Richard C. Deonier, Simon Tavaré, Michael S. Waterman "Computational Genome Analysis An Introduction", chapter6, Springer, (2003). [7] Nello Cristianini and Matthew W. Hahn "Introduction to Computational Genomics A Case Studies Approach" chapter3, Cambridge University Press, (2006). [8] M. O. Dayhoff, R. M. Schwartz, B. C. Orcutt " A model of evolutionary change in proteins, in Atlas of Protein Sequence and Structure", chapter 22, National Biomedical Research Foundation, Washington, DC: p345–358, (1978). [9] S. Henikoff, J. G. Henikoff "Amino acid substitution matrices from protein blocks" Proc. Natl. Acad. Sci. USA, Vol. 89, N°22: p10915- 10919, (1992). [10] C. S. B Needleman, C. D. Wunsch, "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of molecular biology,vol. 48, no. 1, pp. 443-453. (1970). 94 [11] X. Huang, K. M. Chao, "A generalized global alignment Bioinformatics", Vol. 19, N°2: p228– 233, (2003). algorithm, [12] R. A. Cartwright: Ngila, "global pairwise alignments with logarithmic and affine gap costs", Bioinformatics, Vol. 23, N°11:, p1427–1428. (2007). [13] T. F. Smith, M. S. Waterman, "Identification of common molecular subsequences", J. Molecular Biology, no. 147, pp. 195-197, (1981). [14] M.Eissa, A.M.Alzohairy, H.Abobakr, I.Zidan, "Gene- Tracer: Algorithm Tracing Genes Modification from Ancestors through Offsprings". International Journal of Computer Applications (0975 – 8887) Volume 52– No.19, August (2012). [15] Thompson J. D., Higgins D. G., Gibson T. J., CLUSTAL W" improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice" Nucleic Acids Res, vol. 22. pp. 4673–4680, (1994). [16] Report "IUCN Red List of Threatened Species". Internet address: http://www.iucnredlist.org/about/summary-statistics. (2010). [17] Maddison, D. R. and K.-S. Schulz (eds.) "The Tree of Life Web Project". Internet address: http://tolweb.org. (2007). [18] C. S. B Needleman, C. D. Wunsch, "A general method applicable to the search for similarities in the amino acid sequence of two proteins". Journal of molecular biology, vol. 48, no. 1, pp. 443-453. (1970). [19] Sorger et al. "Stress-induced oligomerization and chromosomal relocalization of heat-shock factor". Nature 353, 822 - 827 (31 October 1991); doi:10.1038/353822a0. (1991). [20] Morimoto, R. I. " Regulation of the heat shock transcriptional response:cross talk between a family of heat shock factors, molecular chaperones, andnegative regulators". Genes Dev. 12, 3788-3896. (1998). 95 [21] Nover L, "Expression of heat shock genes in homologous and heterologous systems. Enzyme Microb". Technol. 9, 130–144. (1987). [22] Yokotani N, Ichikawa T, Kondou Y, Matsui M, Hirochika H, Iwabuchi M, Oda K, "Expression of rice heat stress transcription factor OsHsfA2e enhances tolerance to environmental stresses in transgenic Arabidopsis". Planta. 1432-2048. (2007). [23] Mohammed M. Saleh, Ahmed M. Alzohairy, Osama Abdo Mohamed, Gaber H. Alsayed "A Comprehensive Study by Using Different Alignment Algorithms to Demonstrate the Genetic Evolution of Heat Shock Factor 1 (HSF1) in Different Eukaryotic Organisms". IRACST – Engineering Science and Technology: An International Journal (ESTIJ), ISSN: 2250-3498, Vol.3, No.2 Pages:376-382, (2013). [24] Amon, Wolfgang; Farrell (November 2004). "Reactivation of Epstein–Barr virus from latency". Reviews in Medical Virology 15 (3): 149–56. doi:10.1002/rmv.456. PMID 15546128. Retrieved 28 May (2012). [25] Benjamin D.C., Kander,R. J., Volk, W,A. "essential of medical microbiology" 4th edition. J.B. Lippincott Company, Philadelphia. (1991). [26] Richard A. Harvey, Pamela C. Champe, Bruce D. Fisher "Lippincott’s microbiology 2th Edittion" Section 4 Chapter 25, Lippincott Williams & wilkins (2007). [27] Ryan KJ; Ray CG (editors) "Sherris Medical Microbiology (4th ed.) ". McGraw Hill. ISBN 0-8385-8529-9. (2004). [28] Mettenleiter et al. "Molecular Biology of Animal Herpesviruses". Animal Viruses: Molecular Biology. Caister Academic Press. ISBN 1-904455-22-0. . [http://www.horizonpress.com/avir], (2008). [29] Sandri-Goldin RM (editor). " Alpha Herpesviruses: Molecular and Cellular Biology". Caister Academic Press. ISBN 978-1-904455-09-7. (2006). 96 [30] Chayavichitsilp P, Buckwalter JV, Krakowski AC, Friedlander SF. "Herpes simplex". Pediatr Rev 30 (4): 119–29; quiz 130. doi:10.1542/pir.30-4-119. PMID 19339385. April (2009). [31] In the United States, as many as 95% of adults between 35 and 40 years of age have been infected. National Center for Infectious Diseases. http://www.cdc.gov/ncidod/diseases/ebv.htm [32] Staras SA, Dollard SC, Radford KW, Flanders WD, Pass RF, Cannon MJ (November 2006). "Seroprevalence of cytomegalovirus infection in the United States, 1988–1994". Clin. Infect. Dis. 43 (9): 1143–51. doi:10.1086/508173. PMID 17029132. Retrieved (2009). [33] John Carter, Venetia Saunders. "Virology, Principles and Applications". John Wiley & Sons. ISBN 978-0-470-02386-0. [34] Jay C. Brown, William W. Newcomb. "Herpesvirus Capsid Assembly: Insights from Structural Analysis". Current Opinion in Virology 1 (2): 142–149. (2011). [35] Madigan M, Martinko J "Brock Biology of Microorganisms" (11th ed.). Prentice Hall. ISBN 0-13-144329-1, (2006). [36] N. Komarova, D. Wodarz, "ODE models for oncolytic virus dynamics", J of Theoretical Biology, 263530-543, (2010). [37] Y Wang, JP Tian, J Wei. "Lytic cycle: A defining process in oncolytic virotherapy. Applied Mathematical Modelling", Vol. 37, Issue 8, Pages 5962–5978, 15 April (2013). [38] Gaber H. Alsayed, Ahmed M. Alzohairy, Osama Abdo Mohamed, Mohamed M. Saleh. "Mathematical Modeling And Classification Of Viruses From Herpesvirus Family ". computers in biology and medicine: An International Journal, (2014). [39] Alzohairy. A. Mansour, Margaret H. MacDonald, Benjamin F. Matthews (2013). The pJan25 vector series: An enhancement of the gateway-compatible vector pGWB533 for broader promoter testing applications. Plasmid, 69(3):249-56, (2013). 97 [40] Alzohairy, A. Mansour. Gábor Gyulai, Jansen RK. A. Bahieldin (2013) Transposable Elements Domesticated and neofunctionalized by Eukaryotic Genomes PLASMID. 69 (2013) 1–15, (2013). [41] Mansour, A. "Utilization of Genomic Retrotransposon as cladistic molecular markers". Journal of Cell and Molecular Biology. 7(1): 17-28, (2008). [42] Mansour, A. "Epigenetic activation of Genomic Retrotransposon". Journal of Cell and Molecular Biology. 6 (2): 99-107, (2007). [43] Mansour A. "Water Deficit Induction of Copia and Gypsy Genomic Retrotransposons", Plant Stress 3(1)33-39, (2009). [44] Alzohairy, A. Mansour , Mohamed A Yousef, Sherif S Edris, Balázs Kerti and Gábor Gyulai "Detection of long terminal repeat (LTR) retrotransposons reactivation induced by in vitro environmental stresses in barley (Hordeum vulgare) via reverse transcription-quantitative polymerase chain reaction (RT-qPCR)". Life Science Journal (2012; 9(4): 5019-5026, (2012). [45] Ahmed Alzohairy, J.S.M. Sabir, Gabor Gyulai, Rania Younis, Robert Jansen, Ahmed Bahieldin "Environmental Stress Activation of Plant LTRRetrotransposons". Functional Plant Biology. (In press 2014). 98 99