Messaggio di Domenica 21

Malicious Software Recognition Using Fuzzy-Genetic Classification
AnnaMaria Carminelli Gregori
Dipartimento di Elettrica, Elettronica e Informatica
Facoltà di Ingegneria
Università degli Studi di Trieste
carmin @units.it
Abstract
The exponential amount of information available nowadays on the Internet and Web make
the users more and more subject to the attack of malicious code (malware) alias viruses,
worms et similia. Specialised antivirus products [2], [8] try to oppose and unmask the
malware. The great majority of these products analyse some code segments and compare
special strings with a correlated database continuously updated. Without this updating or
during this updating phase a failure can occur. This type of defence’s procedure is named
“static procedure” .
The aim of our research is to find a different way to work. In a very natural way, our basic
idea is “playing in advance" to recognize the malware and to provide its capture. We are
linking with the tool presented in [12] but the idea presented in [10] “of building up a
computer immune system inspiring by biological immune system” suggested us to point out
only the fuzzy-genetic algorithms [5], [6] for the identification, recognition and classification
of the malware.
1. Introduction
The widespread internetworking of computers make the computer’s security a field very important
to the Internet users which are more and more subject to the attack of malicious code (malware)
alias viruses, worms et similia.
The starting point of this work is the curiosity of trying to reply the question:
 What does the malware do ?
 How can we stop the viruses, worms et similia ?
The logic and intuitive replies can be:
 The virus, worm, malware attack the “materia” where “materia” = Hardware & Software of
Computer System.
 To stop the malware we started with the idea of realising a sort of new antivirus, able “to
play in advance”, in order to recognize viruses, worms et similia and to provide for their
capture.
In order to achieve this aim we intended to realise a “dynamic” control of the components more
exposed to the attack’s risk in the Computer System. The control should indicate all their
bad-functionings and/or overcharge in the event of a “strange” process. Our antivirus should
be able to capture the “strange” process in a sort of ”black hole”, for example a no end closed
1
loop.
The way of acting is a dynamic behaviour type as appears in [12], that differs itself from the static
procedures which try the Virus’s Recognition with the analysis of code segments comparing
special strings with a correlated database continuously updated. This is the way that great majority
of specialised current antivirus products work [2], [7].
In [12] we deal the viruses recognition with a “dynamic” control using as first tool an algorithm
based on the neural network Som and the next fuzzy-genetic algorithm help us to test the result
obtained.
In our work we intended to use only the fuzzy logic and genetic algorithms to train a software tool
to point out the malware’s infection in a computer [8] starting form a set of input parameters.
Then we intended to present in deep a fuzzy-genetic algorithm [4], [5], [6], [12], based on
the biological immune fundamentals [10], (also in agreement to [13]) and a deduced dynamic
tool for the identification, recognition and classification of the malware. Instead here is presented
only a model’s implementation of the previously deferred tool.
Why only a model ? Because of a paradox. In spite of the malware’s multitude, it seems unlike to
capture it. When a normal person find a virus, feel panic and, forgetting to save it for a test, think
only to delete it. This is the way things stand and the conclusion is we have only about ten
malicious code. With this number, it possible to talk only about a model’s implementation of the
conceived tool.
2. Problem Formulation
“All organizations and individuals with computers or networks connected to the Internet are
vulnerable to malicious code and viruses. … The protection (or Antivirus ) is often ineffective
against the ‘new breed’of malicious code (viruses, worm, and Trojans).”
These two sentences of Douglas Schweitzer [2], synthesize the “art’s status” of the malware
problem.
As above mentioned, the great majority of antivirus products [1], [2], [8] is using the code
recognition of infecting programs. In this work we are using the “infection status” analysis of
Computer System.
2.1 The decomposition
The problem of recognising the malware can be decomposed into the following sub-problems.
 Investigating on the features and behaviour of viruses, worms, malware and their
classification;
 Defining a metric on the basis of static or dynamic relationship between the
Hardware/Software components of analysed System;
 Using the metric to recognize automatically the viruses, worms, malware and to provide
their capture.
In this paper we face the first sub-problem to estimate the infection status of a Computer System..
2.2 Infection status
To estimate the infection status of Computer System we analysed some characteristic parameters
of System whether in normal condition or infected by malware, in a virtual environment.
Monitors of System (as pointed in Par. 3) supplied to us with these characteristic parameters and
we obtained two sets of their values, normal and infected.
How much the first set differ from the second ?
2
To value this difference we intended to use a logic like human reasoning, based on Artificial
Intelligence and more specifically on a fuzzy-genetic algorithm [5],[6], with DNA encoding
method [13].
2.3 Parameter’s choice
The fundamental and logic base of the our tool is that it may be right to find out the presence of
malware by analysing the “System’s status”. The “System’s status” look like us an immense
multitude of different parameters peculiar and inherent to Operating System as Windows 2000
and XP. What chose?
There are some catalogues of parameters and it isn’t easy to access all them; we used a
parameters mainly complete catalogue. It means that their choice were elaborated by using some
empiric results and … our sensitive nature, ever in the imposed limit (for example of not use the
LAN network). Moreover all the bonds don’t belittle the quality of work, developed to supply
with an useful tool in real application.
Many experiments sequences with different parameter’s choice address us to find which of them
were more significant. As mentioned in [12], we looked like more interesting the parameters as:
reading and coping/sec; IP\Datagramms/sec; Operation on data file/sec; TCP\active Connections.
3. Monitoring System Properties and Data Collection
We worked with Microsoft Windows_2000 Professional Operating System, all in all believing to be of
interest its possible infection because of the its diffusion. To monitor the System we used the freeware
advanced utilities presented by SysInternals [11]. They deal with the System monitoring and allowed
us to compare some characteristic parameters (as in Par. 2.) of System’s processes before and after the
malware inoculation.
The System’s monitors list and their functions follow:
Diskmon: control hard disk System activity;
Filemon: control file System activity;
Regmon: control registers System activity;
Portmon: control serial and parallel port System activity;
TDImon: control UDP e TCP System activity;
TCPview: control open ports and relative TCP, UDP processes ‘s activity;
ListDLLs: display the System used library (DOS);
Handle: display all file pointers used by the System (DOS);
Netstat: display the state of System’s ports ( DOS);
ProcessExplorer: display all the active System’s processes and their execution.
With these tools we recorded the System’s parameters (above mentioned) every 15 seconds in a normal
condition environment or infected by malware after the virus inoculation. In every control there where
elaborated some hundreds values of these parameters and imported in a data base with their average
and standard deviation.
These values, normalized in the interval [0,1], (with using their max and min value) were processed off
line and became the input of the fuzzy-genetic algorithm, as in schema of fig 1, in the training phase.
The relative output is the code DNA used in the test phase as input of fuzzy-genetic algorithm together
the System’s parameter under test: follows a software suited both to test the fuzzy-genetic algorithm
using the well-know values, and to evaluate it’s output with different input.
3
.
fig.1
4. Fuzzy-genetic classification
Genetic algorithms use techniques inspired by evolutionary computation in the biology such as
inheritance, mutation, selection, crossover…. It deal with a computer simulation of a problem, in which
a population of abstract representations (the ”chromosomes”) of possible solutions (called individuals)
evolves toward better solutions in iterative way.
In the context of malware, the ”chromosomes” are the code of some characteristic parameters of
System’s process and the output of our tool could indicate how much the system is infected.
We intended obtain this goal with adaptive approach based on fuzzy logic. So, the problem is to find
the proper fuzzy rules to obtain the “System’s status” from the input data that is the n parameters.
Therefore it is necessary to use a rules network schema as in fig.2 where the connections stand for
active fuzzy rules. The rules are identified by a weight’s binary vector w: its dimension is equal to the
connection’s number (that is 4n2), its element’s value w[i] = 1 only if the connectioni is active, else w[i]
= 0. It represent a binary code of 4n2 bit. The network was built so as some changes in the
connections produce all the logic functions concerning the input parameters (as in Par. 2, 3). To find
the better connections we used the iterative typical evolution of Genetic Algorithms (as in 4.2).
4.1 Fuzzyfication
This process consists in the use the real values of the characteristic parameters of “System’s status”
process and subdivide them for their “evidence” where evidence quantify the truth of each rule (how
much belong to each rule).
In fig.2 appear an example of Fuzzyfication of the n input parameters (n=4) by means of two very
simple functions. It deal with:
m ALTO (x) = x;
m BASSO (x) =1 - x;
defined in the range [0,1].
4
Then there are two fuzzy set ALTO and BASSO with the 2n parameter sharing. Moreover the 2
functions supply us with 2n values useful to implement, by means of their MIN and MAX values, the
fuzzy set’s intersections and union.
fig.2
4.2 . Adaptive genetic algorithm and fitness.
A Genetic Algorithm tends to a conclusion representing as evidence how much the input belongs to a
specific status: in malware case to infection status of System. For this result it necessary to compute
the optimum connection of network previously deferred.
To answer this purpose we trained the network with some parameters of System’s status normal and
other of System’s status infected. We used:
 a matrix Gene with 4n2 rows and N columns where each column represent a possible genetic


code of the fuzzy network that is a possible individual within the N considerate (in our case
N=100);
a matrix Data containing the input parameters: in each column there is a set of characteristic
parameters;
a vector v to denote the situation of each set of parameters (or row of matrix Data): v[i] = 0
refer to normal, v[i] = 1 refer to infected situation.
At start Gene is initialized by an uniform distribution of values 0 and 1. This situation evolves by
means the genetic algorithm which works iteratively for the number of generation which gives the
requested convergence. (In our case 150.) In each generation the changes of Gene[i][j] are: for each
individual j and each row i is computed in Gene[i][j] the network’s answer which gives the probability it
deal with virus. The distance between this answer and the value v[i] indicates the performed error. So,
as fitness function we choose for each individual:
5
fj = i (1-| Gene[i][j] - v[i] | )
In each generation only the individuals with
fj > threshold
can survive for the next generation and their code saved in dna[j]. The threshold is update.
To improve the convergence a mutation about 1% in the matrix Gene is performed and the individuals
are randomly coupled using a cross-over point for the changes.
The changes evolve is can use the 2n values in the fuzzy sets and 4n2 possible connections
identified by vector w. Il funzionamento del programma si basa su una matrice di 4n2 righe ed
colonne. Ciascuna colonna è un possibile codice della rete fuzzy, la matrice
rappresenta quindi una popolazione di N individui di reti fuzzy. Inizialmente
la matrice viene riempita con una distribuzione casuale uniforme di uni e
zeri. Mediante la funzione valuta.m per ogni individuo j e ogni riga i della
matrice di dati viene calcolata la risposta della rete fuzzy. La distanza tra la
risposta yi,j della rete ed il valore νi atteso rappresenta l’errore commesso,
perciò si è pensato di utilizzare come funzione di fitness:
fj =_i
(1 − |yi,j − νi|) (4)
In questo modo vengono fatti sopravvivere gli individui che realizzano un
errore minore. Per avere una convergenza abbastanza rapida si è creata una
mutazione dell’1% all’interno della matrice, inoltre i vari individui sono stati
accoppiati in modo casuale, con l’impiego di un punto di cross-over,
A questo punto si hanno a disposizione 2n valori di appartenenza che
potranno essere utilizzati per implementare regole fuzzy. Ogni connessione
è infatti pesata da un valore binario wi che vale 1 se la connessione è attiva
e 0 se non vi è connessione. Come si è visto in precedenza, l’unione
di fuzzy set si ottiene calcolando il massimo delle funzioni di appartenenza,
mentre l’intersezione è ricavata calcolando il minimo. Poiché mediante
l’intersezione di unioni pesate è possibile costruire tutte le funzioni logiche
possibili, con il sistema presentato nello schema è possibile implementare
tutte le regole fuzzy. Ogni regola fuzzy viene quindi identificata da un vettore
dei pesi w di lunghezza pari al numero di connessioni, che, come si può
facilmente verificare dalla figura, è dato da 4n2.
Il programma valuta.m utilizza questo schema per calcolare la probabilità
che il computer sia infetto, dati n parametri in ingresso ed un vettore binario
di 4n2 componenti che rappresenta le regole.
Dopo la fase di ragionamento segue un’altra fase, detta defuzzificazione,
in cui le conclusioni vengono tradotte da evidenze a valori reali.
Il vettore w dei pesi della rete fuzzy è un codice binario di 4n2 bit. È perciò
6
possibile applicare un algoritmo genetico per ottenere le regole fuzzy che
riescono a correlare meglio i parametri di un sistema con la presenza o meno
di virus. Per far ciò si parte da alcune serie di parametri di cui si conosce a
priori se sono stati rilevati in presenza di virus o su una macchina non infetta.
A ogni serie i di parametri viene associato un valore νi = 0 se si riferisce a una
situazione pulita e νi = 1 se si riferisce ad una situazione infetta. Questi dati
vengono inviati al programma genetico.m, in unamatrice di dati d’ingresso.
18
Il funzionamento del programma si basa su una matrice di 4n2 righe ed
N colonne. Ciascuna colonna è un possibile codice della rete fuzzy, la matrice
rappresenta quindi una popolazione di N individui di reti fuzzy. Inizialmente
la matrice viene riempita con una distribuzione casuale uniforme di uni e
zeri. Mediante la funzione valuta.m per ogni individuo j e ogni riga i della
matrice di dati viene calcolata la risposta della rete fuzzy. La distanza tra la
risposta yi,j della rete ed il valore νi atteso rappresenta l’errore commesso,
perciò si è pensato di utilizzare come funzione di fitness:
fj =_i
(1 − |yi,j − νi|) (4)
In questo modo vengono fatti sopravvivere gli individui che realizzano unerrore minore. Per avere una
convergenza abbastanza rapida si è creata unamutazione dell’1% all’interno della matrice, inoltre i
vari individui sono statiaccoppiati in modo casuale, con l’impiego di un punto di cross-over, come
precedentemente esposto.Il programma ad ogni generazione valuta qual è il codice con migliore
fitness e, se ottiene risultati migliori delle generazioni precedenti lo salva
nella variabile dna direttamente nel workspace, in modo che sia possibile in
ogni momento bloccare l’esecuzione del programma senza perdere il risultato.
…
5. Preliminary experimental results
…
6. Conclusion and Future works
The preliminary experimental results, presented in this paper, appear interesting in the recognition of
some infection status. Moreover they show also different types of viruses attacking the virtual machine.
At last a first comparative analysis between fuzzy logic vs neural network classification, point out that
the fuzzy logic recognition can be more accurate.
Nevertheless it is right to point out the exigency of increasing the sample of the used malware. One of
the future works will be looking for a more significant sample. Another work should consider the
choice and the determination of characteristic System’s parameters in some virtual environments, to
compare the eventually difference between them in different virtual environments.
At this point we will face the sub-problem of defining a metric on the basis of dynamic relationship
between the Hardware/Software components of analysed System. Using this metric we will create a
tool to recognize automatically the viruses, worms, malware and to provide their capture.
7. Bibliography
7
[1] David Harley, Robert Slade, Urs E. Gattiker:“Viruses Revealed”, Osborne Mc Graw Hill,
2000
[2] Douglas Schweitzer: “Securing the Network from Malicious Code: A Complete Guide to
Defending the Network Against Viruses, Worms, and Trojans”, Wiley, 2002
[3] Jeffrey O. Kephart, Steve R. White: “Measuring and Modeling computer virus prevalence”,
Proceedings of the 1993 IEEE
[4] José R. C. Piqueira, Betyna F. Navarro, Luiz H. Monteiro: "Epidemiological Model Applied
to Viruses in Computer Network", Journal in Comp. Science 1(1):31-34,2005
[5] F. Russo: “Fuzzy Model Fundamental”, Wiley Encyclopedia of Electrical and Electronics
Engineering, (J.G. Webster ed.) John Wiley & Sons, vol. 8, 1999.
[6] D.E. Goldberg: “Genetic Algorithms in Search, Optimisation and machine Learning“,
Addison-Wesley, 1989.
[7] T. Kohonen: “The self organising map(7)“, http://www.cis.huf.fi/research/
[8] P. Attivissimo: “Acchiappavirus“, http:/www.attivissimo.net/me-work/lavoro.htm
[9] A.M. Carminelli Gregori, R. Cobalti, G.Vercelli: “Intelligent Web Agents for Information
Retrival and Classification”, Proc. of Intl. Conf. on Practical Application of Intelligent Agents
and Multi-Agent Technologies, PAAM-99, 1999
[10]Ya-jing ZHANG: “A Novel Immune Detection Algorithm for Anomaly Detection”, Proceedings
of the International Symposium on Intelligent Control, Limassol, Cyprus, June 2005
[11] www.sysinternals.com
[12] A.M. Carminelli Gregori, M. Nolich, L. Camilotti, M. Pagot, R. Pelizzola, R. Dapretto, I. Golob,
A. Zorn: “Riconoscimento di virus informatici mediante l’utilizzo di una rete non supevisionata Som
e di un algoritmo fuzzy-genetico”, Atti del Congresso Aica, Udine, Ottobre 2005
]
1. Idea di base sulla ricerca sui virus: trovare risposte alle domande:
 Cosa fanno i Virus, I Worm, i codici maligni ?
 Come controattaccarli ?
Risposte logiche da un punto di vista intuitivo:
 Virus, Worm, codici maligni vanno all’ attacco della “materia” dove “materia” =
insieme di parti Hardware e Software di un Sistema di calcolo;
 Il contrattacco consiste nell’inventare un antivirus capace di “giocare di anticipo”
dove “giocare di anticipo” può essere controllare il comportamento (la dinamica)
delle componenti più a rischio del Sistema di calcolo. Il controllo dovrebbe segnalare
8
qualsiasi loro malfunzionamento e/o sovraccarico in presenza di qualche processo
“strano”. L’ antivirus dovrebbe riuscire ad imprigionare tale processo in una sorta di
“buco nero” (per esempio un loop senza uscita).
2. Il modo d’ agire sopra descritto è di tipo dinamico (comportamentale) e si contrappone a quello
statico che prevede un riconoscimento del Virus tramite la definizione di sua caratteristiche
statiche.
3. Il problema si può comunque scomporre nei sottoproblemi seguenti:
 indagine sul comportamento e/o le caratteristiche dei Virus;
 definizione di una metrica in base a relazioni statiche e/o dinamiche esistenti tra le
componenti software e hardware del Sistema in esame;
 utilizzo della metrica per riconoscere e catturare i Virus.
La nota presente si occupa del primo sottoproblema utilizzando un algoritmo Fuzzy-Genetico.
9