Malicious Software Recognition Using Fuzzy-Genetic Classification AnnaMaria Carminelli Gregori Dipartimento di Elettrica, Elettronica e Informatica Facoltà di Ingegneria Università degli Studi di Trieste carmin @units.it Abstract The exponential amount of information available nowadays on the Internet and Web make the users more and more subject to the attack of malicious code (malware) alias viruses, worms et similia. Specialised antivirus products [2], [8] try to oppose and unmask the malware. The great majority of these products analyse some code segments and compare special strings with a correlated database continuously updated. Without this updating or during this updating phase a failure can occur. This type of defence’s procedure is named “static procedure” . The aim of our research is to find a different way to work. In a very natural way, our basic idea is “playing in advance" to recognize the malware and to provide its capture. We are linking with the tool presented in [12] but the idea presented in [10] “of building up a computer immune system inspiring by biological immune system” suggested us to point out only the fuzzy-genetic algorithms [5], [6] for the identification, recognition and classification of the malware. 1. Introduction The widespread internetworking of computers make the computer’s security a field very important to the Internet users which are more and more subject to the attack of malicious code (malware) alias viruses, worms et similia. The starting point of this work is the curiosity of trying to reply the question: What does the malware do ? How can we stop the viruses, worms et similia ? The logic and intuitive replies can be: The virus, worm, malware attack the “materia” where “materia” = Hardware & Software of Computer System. To stop the malware we started with the idea of realising a sort of new antivirus, able “to play in advance”, in order to recognize viruses, worms et similia and to provide for their capture. In order to achieve this aim we intended to realise a “dynamic” control of the components more exposed to the attack’s risk in the Computer System. The control should indicate all their bad-functionings and/or overcharge in the event of a “strange” process. Our antivirus should be able to capture the “strange” process in a sort of ”black hole”, for example a no end closed 1 loop. The way of acting is a dynamic behaviour type as appears in [12], that differs itself from the static procedures which try the Virus’s Recognition with the analysis of code segments comparing special strings with a correlated database continuously updated. This is the way that great majority of specialised current antivirus products work [2], [7]. In [12] we deal the viruses recognition with a “dynamic” control using as first tool an algorithm based on the neural network Som and the next fuzzy-genetic algorithm help us to test the result obtained. In our work we intended to use only the fuzzy logic and genetic algorithms to train a software tool to point out the malware’s infection in a computer [8] starting form a set of input parameters. Then we intended to present in deep a fuzzy-genetic algorithm [4], [5], [6], [12], based on the biological immune fundamentals [10], (also in agreement to [13]) and a deduced dynamic tool for the identification, recognition and classification of the malware. Instead here is presented only a model’s implementation of the previously deferred tool. Why only a model ? Because of a paradox. In spite of the malware’s multitude, it seems unlike to capture it. When a normal person find a virus, feel panic and, forgetting to save it for a test, think only to delete it. This is the way things stand and the conclusion is we have only about ten malicious code. With this number, it possible to talk only about a model’s implementation of the conceived tool. 2. Problem Formulation “All organizations and individuals with computers or networks connected to the Internet are vulnerable to malicious code and viruses. … The protection (or Antivirus ) is often ineffective against the ‘new breed’of malicious code (viruses, worm, and Trojans).” These two sentences of Douglas Schweitzer [2], synthesize the “art’s status” of the malware problem. As above mentioned, the great majority of antivirus products [1], [2], [8] is using the code recognition of infecting programs. In this work we are using the “infection status” analysis of Computer System. 2.1 The decomposition The problem of recognising the malware can be decomposed into the following sub-problems. Investigating on the features and behaviour of viruses, worms, malware and their classification; Defining a metric on the basis of static or dynamic relationship between the Hardware/Software components of analysed System; Using the metric to recognize automatically the viruses, worms, malware and to provide their capture. In this paper we face the first sub-problem to estimate the infection status of a Computer System.. 2.2 Infection status To estimate the infection status of Computer System we analysed some characteristic parameters of System whether in normal condition or infected by malware, in a virtual environment. Monitors of System (as pointed in Par. 3) supplied to us with these characteristic parameters and we obtained two sets of their values, normal and infected. How much the first set differ from the second ? 2 To value this difference we intended to use a logic like human reasoning, based on Artificial Intelligence and more specifically on a fuzzy-genetic algorithm [5],[6], with DNA encoding method [13]. 2.3 Parameter’s choice The fundamental and logic base of the our tool is that it may be right to find out the presence of malware by analysing the “System’s status”. The “System’s status” look like us an immense multitude of different parameters peculiar and inherent to Operating System as Windows 2000 and XP. What chose? There are some catalogues of parameters and it isn’t easy to access all them; we used a parameters mainly complete catalogue. It means that their choice were elaborated by using some empiric results and … our sensitive nature, ever in the imposed limit (for example of not use the LAN network). Moreover all the bonds don’t belittle the quality of work, developed to supply with an useful tool in real application. Many experiments sequences with different parameter’s choice address us to find which of them were more significant. As mentioned in [12], we looked like more interesting the parameters as: reading and coping/sec; IP\Datagramms/sec; Operation on data file/sec; TCP\active Connections. 3. Monitoring System Properties and Data Collection We worked with Microsoft Windows_2000 Professional Operating System, all in all believing to be of interest its possible infection because of the its diffusion. To monitor the System we used the freeware advanced utilities presented by SysInternals [11]. They deal with the System monitoring and allowed us to compare some characteristic parameters (as in Par. 2.) of System’s processes before and after the malware inoculation. The System’s monitors list and their functions follow: Diskmon: control hard disk System activity; Filemon: control file System activity; Regmon: control registers System activity; Portmon: control serial and parallel port System activity; TDImon: control UDP e TCP System activity; TCPview: control open ports and relative TCP, UDP processes ‘s activity; ListDLLs: display the System used library (DOS); Handle: display all file pointers used by the System (DOS); Netstat: display the state of System’s ports ( DOS); ProcessExplorer: display all the active System’s processes and their execution. With these tools we recorded the System’s parameters (above mentioned) every 15 seconds in a normal condition environment or infected by malware after the virus inoculation. In every control there where elaborated some hundreds values of these parameters and imported in a data base with their average and standard deviation. These values, normalized in the interval [0,1], (with using their max and min value) were processed off line and became the input of the fuzzy-genetic algorithm, as in schema of fig 1, in the training phase. The relative output is the code DNA used in the test phase as input of fuzzy-genetic algorithm together the System’s parameter under test: follows a software suited both to test the fuzzy-genetic algorithm using the well-know values, and to evaluate it’s output with different input. 3 . fig.1 4. Fuzzy-genetic classification Genetic algorithms use techniques inspired by evolutionary computation in the biology such as inheritance, mutation, selection, crossover…. It deal with a computer simulation of a problem, in which a population of abstract representations (the ”chromosomes”) of possible solutions (called individuals) evolves toward better solutions in iterative way. In the context of malware, the ”chromosomes” are the code of some characteristic parameters of System’s process and the output of our tool could indicate how much the system is infected. We intended obtain this goal with adaptive approach based on fuzzy logic. So, the problem is to find the proper fuzzy rules to obtain the “System’s status” from the input data that is the n parameters. Therefore it is necessary to use a rules network schema as in fig.2 where the connections stand for active fuzzy rules. The rules are identified by a weight’s binary vector w: its dimension is equal to the connection’s number (that is 4n2), its element’s value w[i] = 1 only if the connectioni is active, else w[i] = 0. It represent a binary code of 4n2 bit. The network was built so as some changes in the connections produce all the logic functions concerning the input parameters (as in Par. 2, 3). To find the better connections we used the iterative typical evolution of Genetic Algorithms (as in 4.2). 4.1 Fuzzyfication This process consists in the use the real values of the characteristic parameters of “System’s status” process and subdivide them for their “evidence” where evidence quantify the truth of each rule (how much belong to each rule). In fig.2 appear an example of Fuzzyfication of the n input parameters (n=4) by means of two very simple functions. It deal with: m ALTO (x) = x; m BASSO (x) =1 - x; defined in the range [0,1]. 4 Then there are two fuzzy set ALTO and BASSO with the 2n parameter sharing. Moreover the 2 functions supply us with 2n values useful to implement, by means of their MIN and MAX values, the fuzzy set’s intersections and union. fig.2 4.2 . Adaptive genetic algorithm and fitness. A Genetic Algorithm tends to a conclusion representing as evidence how much the input belongs to a specific status: in malware case to infection status of System. For this result it necessary to compute the optimum connection of network previously deferred. To answer this purpose we trained the network with some parameters of System’s status normal and other of System’s status infected. We used: a matrix Gene with 4n2 rows and N columns where each column represent a possible genetic code of the fuzzy network that is a possible individual within the N considerate (in our case N=100); a matrix Data containing the input parameters: in each column there is a set of characteristic parameters; a vector v to denote the situation of each set of parameters (or row of matrix Data): v[i] = 0 refer to normal, v[i] = 1 refer to infected situation. At start Gene is initialized by an uniform distribution of values 0 and 1. This situation evolves by means the genetic algorithm which works iteratively for the number of generation which gives the requested convergence. (In our case 150.) In each generation the changes of Gene[i][j] are: for each individual j and each row i is computed in Gene[i][j] the network’s answer which gives the probability it deal with virus. The distance between this answer and the value v[i] indicates the performed error. So, as fitness function we choose for each individual: 5 fj = i (1-| Gene[i][j] - v[i] | ) In each generation only the individuals with fj > threshold can survive for the next generation and their code saved in dna[j]. The threshold is update. To improve the convergence a mutation about 1% in the matrix Gene is performed and the individuals are randomly coupled using a cross-over point for the changes. The changes evolve is can use the 2n values in the fuzzy sets and 4n2 possible connections identified by vector w. Il funzionamento del programma si basa su una matrice di 4n2 righe ed colonne. Ciascuna colonna è un possibile codice della rete fuzzy, la matrice rappresenta quindi una popolazione di N individui di reti fuzzy. Inizialmente la matrice viene riempita con una distribuzione casuale uniforme di uni e zeri. Mediante la funzione valuta.m per ogni individuo j e ogni riga i della matrice di dati viene calcolata la risposta della rete fuzzy. La distanza tra la risposta yi,j della rete ed il valore νi atteso rappresenta l’errore commesso, perciò si è pensato di utilizzare come funzione di fitness: fj =_i (1 − |yi,j − νi|) (4) In questo modo vengono fatti sopravvivere gli individui che realizzano un errore minore. Per avere una convergenza abbastanza rapida si è creata una mutazione dell’1% all’interno della matrice, inoltre i vari individui sono stati accoppiati in modo casuale, con l’impiego di un punto di cross-over, A questo punto si hanno a disposizione 2n valori di appartenenza che potranno essere utilizzati per implementare regole fuzzy. Ogni connessione è infatti pesata da un valore binario wi che vale 1 se la connessione è attiva e 0 se non vi è connessione. Come si è visto in precedenza, l’unione di fuzzy set si ottiene calcolando il massimo delle funzioni di appartenenza, mentre l’intersezione è ricavata calcolando il minimo. Poiché mediante l’intersezione di unioni pesate è possibile costruire tutte le funzioni logiche possibili, con il sistema presentato nello schema è possibile implementare tutte le regole fuzzy. Ogni regola fuzzy viene quindi identificata da un vettore dei pesi w di lunghezza pari al numero di connessioni, che, come si può facilmente verificare dalla figura, è dato da 4n2. Il programma valuta.m utilizza questo schema per calcolare la probabilità che il computer sia infetto, dati n parametri in ingresso ed un vettore binario di 4n2 componenti che rappresenta le regole. Dopo la fase di ragionamento segue un’altra fase, detta defuzzificazione, in cui le conclusioni vengono tradotte da evidenze a valori reali. Il vettore w dei pesi della rete fuzzy è un codice binario di 4n2 bit. È perciò 6 possibile applicare un algoritmo genetico per ottenere le regole fuzzy che riescono a correlare meglio i parametri di un sistema con la presenza o meno di virus. Per far ciò si parte da alcune serie di parametri di cui si conosce a priori se sono stati rilevati in presenza di virus o su una macchina non infetta. A ogni serie i di parametri viene associato un valore νi = 0 se si riferisce a una situazione pulita e νi = 1 se si riferisce ad una situazione infetta. Questi dati vengono inviati al programma genetico.m, in unamatrice di dati d’ingresso. 18 Il funzionamento del programma si basa su una matrice di 4n2 righe ed N colonne. Ciascuna colonna è un possibile codice della rete fuzzy, la matrice rappresenta quindi una popolazione di N individui di reti fuzzy. Inizialmente la matrice viene riempita con una distribuzione casuale uniforme di uni e zeri. Mediante la funzione valuta.m per ogni individuo j e ogni riga i della matrice di dati viene calcolata la risposta della rete fuzzy. La distanza tra la risposta yi,j della rete ed il valore νi atteso rappresenta l’errore commesso, perciò si è pensato di utilizzare come funzione di fitness: fj =_i (1 − |yi,j − νi|) (4) In questo modo vengono fatti sopravvivere gli individui che realizzano unerrore minore. Per avere una convergenza abbastanza rapida si è creata unamutazione dell’1% all’interno della matrice, inoltre i vari individui sono statiaccoppiati in modo casuale, con l’impiego di un punto di cross-over, come precedentemente esposto.Il programma ad ogni generazione valuta qual è il codice con migliore fitness e, se ottiene risultati migliori delle generazioni precedenti lo salva nella variabile dna direttamente nel workspace, in modo che sia possibile in ogni momento bloccare l’esecuzione del programma senza perdere il risultato. … 5. Preliminary experimental results … 6. Conclusion and Future works The preliminary experimental results, presented in this paper, appear interesting in the recognition of some infection status. Moreover they show also different types of viruses attacking the virtual machine. At last a first comparative analysis between fuzzy logic vs neural network classification, point out that the fuzzy logic recognition can be more accurate. Nevertheless it is right to point out the exigency of increasing the sample of the used malware. One of the future works will be looking for a more significant sample. Another work should consider the choice and the determination of characteristic System’s parameters in some virtual environments, to compare the eventually difference between them in different virtual environments. At this point we will face the sub-problem of defining a metric on the basis of dynamic relationship between the Hardware/Software components of analysed System. Using this metric we will create a tool to recognize automatically the viruses, worms, malware and to provide their capture. 7. Bibliography 7 [1] David Harley, Robert Slade, Urs E. Gattiker:“Viruses Revealed”, Osborne Mc Graw Hill, 2000 [2] Douglas Schweitzer: “Securing the Network from Malicious Code: A Complete Guide to Defending the Network Against Viruses, Worms, and Trojans”, Wiley, 2002 [3] Jeffrey O. Kephart, Steve R. White: “Measuring and Modeling computer virus prevalence”, Proceedings of the 1993 IEEE [4] José R. C. Piqueira, Betyna F. Navarro, Luiz H. Monteiro: "Epidemiological Model Applied to Viruses in Computer Network", Journal in Comp. Science 1(1):31-34,2005 [5] F. Russo: “Fuzzy Model Fundamental”, Wiley Encyclopedia of Electrical and Electronics Engineering, (J.G. Webster ed.) John Wiley & Sons, vol. 8, 1999. [6] D.E. Goldberg: “Genetic Algorithms in Search, Optimisation and machine Learning“, Addison-Wesley, 1989. [7] T. Kohonen: “The self organising map(7)“, http://www.cis.huf.fi/research/ [8] P. Attivissimo: “Acchiappavirus“, http:/www.attivissimo.net/me-work/lavoro.htm [9] A.M. Carminelli Gregori, R. Cobalti, G.Vercelli: “Intelligent Web Agents for Information Retrival and Classification”, Proc. of Intl. Conf. on Practical Application of Intelligent Agents and Multi-Agent Technologies, PAAM-99, 1999 [10]Ya-jing ZHANG: “A Novel Immune Detection Algorithm for Anomaly Detection”, Proceedings of the International Symposium on Intelligent Control, Limassol, Cyprus, June 2005 [11] www.sysinternals.com [12] A.M. Carminelli Gregori, M. Nolich, L. Camilotti, M. Pagot, R. Pelizzola, R. Dapretto, I. Golob, A. Zorn: “Riconoscimento di virus informatici mediante l’utilizzo di una rete non supevisionata Som e di un algoritmo fuzzy-genetico”, Atti del Congresso Aica, Udine, Ottobre 2005 ] 1. Idea di base sulla ricerca sui virus: trovare risposte alle domande: Cosa fanno i Virus, I Worm, i codici maligni ? Come controattaccarli ? Risposte logiche da un punto di vista intuitivo: Virus, Worm, codici maligni vanno all’ attacco della “materia” dove “materia” = insieme di parti Hardware e Software di un Sistema di calcolo; Il contrattacco consiste nell’inventare un antivirus capace di “giocare di anticipo” dove “giocare di anticipo” può essere controllare il comportamento (la dinamica) delle componenti più a rischio del Sistema di calcolo. Il controllo dovrebbe segnalare 8 qualsiasi loro malfunzionamento e/o sovraccarico in presenza di qualche processo “strano”. L’ antivirus dovrebbe riuscire ad imprigionare tale processo in una sorta di “buco nero” (per esempio un loop senza uscita). 2. Il modo d’ agire sopra descritto è di tipo dinamico (comportamentale) e si contrappone a quello statico che prevede un riconoscimento del Virus tramite la definizione di sua caratteristiche statiche. 3. Il problema si può comunque scomporre nei sottoproblemi seguenti: indagine sul comportamento e/o le caratteristiche dei Virus; definizione di una metrica in base a relazioni statiche e/o dinamiche esistenti tra le componenti software e hardware del Sistema in esame; utilizzo della metrica per riconoscere e catturare i Virus. La nota presente si occupa del primo sottoproblema utilizzando un algoritmo Fuzzy-Genetico. 9