Yacine BOUZIDA

applications différentes de l'analyse en composante principale pour la détection d'intrusion. ... applications of principal component analysis to intrusion detection. ... my research works and for offering me a post-doc position in the Security and Access team .... Coordinated Attack Response and Detection System. CART.
1MB taille 2 téléchargements 262 vues
N◦ d’ordre: 2006telb0009 ` THESE

Pr´esent´ee `a ´ ´ ´ ECOMMUNICATIONS ´ l’ECOLE NATIONALE SUPERIEURE DES TEL DE BRETAGNE en habilitation conjointe avec l’universit´ e de RENNES 1

pour obtenir le grade de DOCTEUR de l’ENST Bretagne Mention : Informatique

par

Yacine BOUZIDA

APPLICATION DE L’ANALYSE EN COMPOSANTE PRINCIPALE ´ POUR LA DETECTION D’INTRUSION ET ´ ´ DETECTION DE NOUVELLES ATTAQUES PAR APPRENTISSAGE SUPERVISE

Soutenue le 24 mars 2006 devant la commission d’Examen

Composition du Jury - Rapporteurs :

- Examinateurs :

- Invit´e

Herv´e Debar Micha¨el Rusinowitch

Expert Senior et HDR, France Telecom, Caen

Refik Molva Baudouin Le Charlier Fr´ed´eric Cuppens Sylvain Gombault

Professeur, GET/EURECOM, Sophia Antipolis

Christophe Mangin

Ing´enieur expert, Mitsubishi, Rennes

Directeur de Recherche, LORIA-INRIA Lorraine, Nancy

Professeur, Universit´e catholique de Louvain, Belgique Professeur, GET/ENST-Bretagne, Rennes Enseignant chercheur, GET/ENST-Bretagne, Rennes

c

2006 Yacine Bouzida Tous droits r´eserv´es

Principal Component Analysis for Intrusion Detection and Supervised Learning for New Attack Detection by Yacine Bouzida A dissertation submitted to the Graduate Faculty of Ecole Nationale Sup´erieure des T´el´ecommunications de Bretagne for the degree of Doctor of Philosophy in Computer Science

Defense date: March 24th 2006

Committee:

Fr´ed´eric Cuppens Herv´e Debar Sylvain Gombault Baudouin Le Charlier Christophe Mangin Refik Molva Micha¨el Rusinowitch

Supervisor Reporter Co-supervisor Examiner Examiner Examiner Reporter

Professor at ENST-Bretagne, Rennes Senior expert at France Telecom, Caen Associate professor at ENST-Bretagne, Rennes Professor at Catholic University of Louvain, Belgium Expert Engineer at Mitsubishi, Rennes Professor at Eurecom, Sophia Antipolis Researcher head at LORIA-INRIA Lorraine, Nancy

c

2006 Yacine Bouzida All rights reserved

To the memory of my father who passed away 25 days after my dissertation defense

R´ esum´ e La d´etection d’intrusion est un m´ecanisme essentiel pour la protection des syst`emes d’information. En plus des m´ethodes pr´eventives, les syst`emes de d´etection d’intrusion sont largement d´eploy´es par les administrateurs de s´ecurit´e. Cette th`ese pr´esente deux applications diff´erentes de l’analyse en composante principale pour la d´etection d’intrusion. Elle propose aussi une nouvelle approche bas´ee sur l’apprentissage supervis´e afin de d´etecter les nouvelles attaques. Dans la premi`ere application, l’analyse en composante principale est utilis´ee pour distinguer les comportements normaux des utilisateurs des comportements malveillants. Dans la seconde application, elle est utilis´ee comme une m´ethode de r´eduction avant d’appliquer les mod`eles de classification qui fournissent des signatures d’intrusion. Un autre r´esultat de cette th`ese est l’am´elioration des techniques d’apprentissage supervis´e pour la d´etection de nouvelles attaques. Les r´esultats des diff´erentes exp´erimentations bas´ees sur l’analyse en composante principale et celles relatives `a l’am´elioration des techniques supervis´ees pour la d´etection d’intrusion sont pr´esent´es et discut´es.

Mots-cl´ es : d´etection d’intrusion, analyse en composante principale, apprentissage supervis´e, nouvelles attaques.

Abstract Intrusion detection is an essential mechanism to protect computer systems from many attacks. Besides the classical prevention security tools, intrusion detection systems (IDS) are nowadays widely used by the security administrators. This thesis presents two different applications of principal component analysis to intrusion detection. It also discusses a new approach for detecting new attacks using supervised learning techniques. First, principal component analysis is used as an anomaly technique for distinguishing between normal and abnormal user profiles. Second, it is used as a reduction technique before applying classification models that automatically produce intrusion rules for misuse detection. Finally, improvement of supervised learning techniques are discussed for detecting new attacks. The different results and experiments performed using the principal component analysis and the enhanced supervised learning technique are thoroughly presented and discussed.

Keywords: intrusion detection, principal component analysis, supervised learning, new attacks.

Acknowledgments The work presented in this thesis has been carried out in the SERES project team at ENST Bretagne from October 2002 to November 2005. I would like to thank my supervisor Fr´ed´eric Cuppens for his guidance during the past three years. Fr´ed´eric has been very patient in helping me to develop my thesis topic and I wish to express my gratitude for this. I would like also to thank Sylvain Gombault for being a co-supervisor of my thesis and who made my research at ENST-Bretagne enjoyable. My participation in many research projects has allowed me to discover another dimension of research, where by cooperation with other academic and industrial teams made me work in such a way that my research would be beneficial for both communities. Many thanks go to the committee members of my dissertation defense; Refik Molva, Christophe Mangin and the chairman Baudouin Le Charlier. I would like to thank Herv´e Debar and Micha¨el Rusinowitch for having reviewed my dissertation and for their invaluable comments. Many thanks go to Nora Cuppens for her advice and support since the first day she joined ENST Bretagne. Nora provided me with many technical ideas that permitted me to boost my research and work efficiently. I would also like to thank her for the various comments she made on the first drafts of my dissertation. Nora, thank you very much! I thank Gilbert Martineau for admitting me to the Networks, Security and Multimedia department at ENST Bretagne and for providing me with wonderful support during these last four years. I would also like to thank Xavier Lagrange for his support. I would also like to thank Isabelle Moreau and Christophe Mangin for their interest in my research works and for offering me a post-doc position in the Security and Access team at Mitsubishi Electric. I am thankful to all members of the different projects I participated; GET-DDoS and ACI-DADDi for the various ideas we exchanged during the meetings. All members of our SERES project team; Ahmed, Thierry, Joaquin, C´eline and Tony are sources for ideas and fun. I thank them for making my stay at ENST Bretagne productive and enjoyable. This thesis would not have been possible without the support of my family. First and foremost, this thesis is dedicated to the memory of my father. He was always behind my steps, ever since my childhood. He encouraged me to choose the computer science field since the age of 14, when I started to write my first programs at the high school. His spirit and advice will be with me forever. I also dedicate this thesis to my lovely mother who is always encouraging me to make the best effort in order to realize my dreams. I thank my sisters and brothers for all their help and support. Whatever I do for them would never be enough to give back a fraction of what they did and still do for me.

Contents R´ esum´ e

vii

Abstract

ix

Acknowledgments

xi

Contents

xiii

Acronyms and Abbreviations

xv

List of Figures

xvii

List of Tables

xix

1 Introduction 1.1 Context statement and our approach . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Different approaches of intrusion detection 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Anomaly based . . . . . . . . . . . . . . . . . . . . . . 2.2.1 DENNING intrusion detection model . . . . . 2.2.2 Policy based . . . . . . . . . . . . . . . . . . . 2.2.3 Immune systems . . . . . . . . . . . . . . . . . 2.3 Misuse based . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Signature based . . . . . . . . . . . . . . . . . . 2.3.2 Scenario based . . . . . . . . . . . . . . . . . . 2.4 Cooperative approaches . . . . . . . . . . . . . . . . . 2.4.1 Alert aggregation and fusion . . . . . . . . . . 2.4.2 Alert correlation . . . . . . . . . . . . . . . . . 2.4.3 Cooperative architectures in intrusion detection 2.5 Limitations of the current IDSs . . . . . . . . . . . . . 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

1 5 6 6 7 7 7 8 11 16 18 19 21 28 29 34 42 43 44

3 EigenProfiles to Intrusion detection 45 3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.1.1 SRI IDES statistical Anomaly detection tool . . . . . . . . . . . . . . 45

xiv 3.2

3.3 3.4

3.5

3.6

3.1.2 Hyperview—a neural network component for intrusion Principal component analysis . . . . . . . . . . . . . . . . . . 3.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Basic principles . . . . . . . . . . . . . . . . . . . . . . The eigenprofiles approach . . . . . . . . . . . . . . . . . . . . Different steps of the method . . . . . . . . . . . . . . . . . . 3.4.1 Initialization procedure . . . . . . . . . . . . . . . . . 3.4.2 The detection and identification procedure . . . . . . 3.4.3 Summary of the eigenprofiles approach . . . . . . . . . Experimental results . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 A simple data set . . . . . . . . . . . . . . . . . . . . . 3.5.2 A real data set experimentation . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS detection . . . 48 . . . . . . . . . 51 . . . . . . . . . 51 . . . . . . . . . 52 . . . . . . . . . 56 . . . . . . . . . 57 . . . . . . . . . 59 . . . . . . . . . 62 . . . . . . . . . 62 . . . . . . . . . 63 . . . . . . . . . 63 . . . . . . . . . 66 . . . . . . . . . 68

4 Eigenconnections and supervised techniques to intrusion detection 4.1 Data mining and knowledge data discovery . . . . . . . . . . . . . . . . 4.1.1 Different steps for the knowledge discovery process . . . . . . . . 4.1.2 Data mining algorithms for intrusion detection . . . . . . . . . . 4.2 Supervised Classification models for intrusion detection . . . . . . . . . 4.2.1 Classifier Building task . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Machine learning approaches . . . . . . . . . . . . . . . . . . . . 4.3 Nearest neighbor and decision trees for intrusion detection . . . . . . . . 4.3.1 Nearest neighbor classification . . . . . . . . . . . . . . . . . . . 4.3.2 Classification by decision trees induction . . . . . . . . . . . . . . 4.4 Eigenconnection approach for intrusion detection . . . . . . . . . . . . . 4.5 Experimental data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 DARPA 98 data pre-processing . . . . . . . . . . . . . . . . . . . 4.5.2 Discussions on the DARPA 98 and KDD 99 pre-processing . . . 4.6 Experimental methodology and results . . . . . . . . . . . . . . . . . . . 4.6.1 Nearest neighbor with/without PCA . . . . . . . . . . . . . . . . 4.6.2 Decision trees with/without PCA . . . . . . . . . . . . . . . . . . 4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Novel attacks detection 5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 EBayes TCP—Adaptive Bayesian model . . . . . . 5.1.2 Unsupervised anomaly detection . . . . . . . . . . 5.1.3 Artificial anomalies for detecting unknown network 5.2 Motivation and problem statement . . . . . . . . . . . . . 5.3 Backpropagation technique for intrusion detection . . . . 5.3.1 Background . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Experimental methodology and results . . . . . . . 5.4 Improving the decision trees for intrusion detection . . . . 5.4.1 Background . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Improving the classification process . . . . . . . . . 5.4.3 Experimental Analysis of KDD99 . . . . . . . . . .

. . . . . . . . . . . . . . . . . . intrusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

71 71 72 73 76 76 76 77 77 78 80 81 83 85 87 87 89 91

. . . . . . . . . . . .

95 95 96 99 102 104 107 107 108 113 113 115 116

CONTENTS 5.5 5.6 5.7

xv

Why KDD99 is not an appropriate transformation? . . . . . . . . . . . . . . . 122 Other experiments of new attacks detection . . . . . . . . . . . . . . . . . . . 124 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

6 Conclusion 6.1 Overview . . . . . . . . . . . 6.2 Thesis contributions . . . . . 6.3 Future work and open issues . 6.4 Thesis Summary . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Bibliography

127 127 128 129 130 131

A Description of the attacks in the DARPA 98 intrusion detection data sets

1

B Intrusion and intrusion detection glossary

7

Acronyms and Abbreviations AC AdOr-BAC BSM C4.5 C4.5rules C5 CARDS CART COAST CPT CRIM CR DAC DARPA DDoS DDT DIT DoS DSL DT DTE DTEL EWMA FRR GASSATA GCT IDES IDIOT IDMEF IDS ISP IS IT KDD kNN LAMBDA LAN MAC MADAM/ID MIT

Aggregation and Correlation algorithm Administration of the Or-BAC model Basic Security Module Decision Tree Induction Algorithm version 4.5 A C4.5 companion for generating pruned rules Decision Tree Induction Algorithm version 5 Coordinated Attack Response and Detection System Classification And Regression Trees Computer Operations, Audit, and Security Technology Cost Per Test or Conditional Probability Table Coop´eration et Reconnaissance d’Intentions Malveillantes Confusion Ratio Discretionary Access Control Defense Advanced Research Projects Agency Distributed Denial of Service Domain Definition Table Domain Interaction Table Denial of Service Digital Subscriber Line Decision Trees Domain and Type Enforcement Domain and Type Enforcement Language Exponentially Weighted Moving Average Failure Reject Ratio Genetic Algorithms for a Simplified Security Audit Trail Analysis Granger Causality Test Intrusion Detection Expert System Intrusion Detection In Our Time Intrusion Detection Exchange Format Intrusion Detection System Internet Service Provider Information System Information Technology Knowledge Discovery in Databases k Nearest Neighbors LAnguage to Model a Database for Detection of Attacks Local Area Network Mandatory Access Control Mining Audit Data for Automated Models for Intrusion Detection Massachusetts Institute of Technology

xviii MS-SQL NIDS NN Or-BAC OS OTN PCA PSP R2L RBAC RTN SATAN SIR SQL SRI SRR SSO SVM U2R USTAT

Acronyms and Abbreviations MiscroSoft Structured Query Language Network Intrusion Detection System Nearest Neighbor Organization-Based Access Control Operating System Option Tree Node Principal Component Analysis Percentage of Successful Prediction Remote to Local Role-Based Access Control Rule Tree Node Security Administrator Tool for Analyzing Networks Successful Identification Ratio Structured Query Language Stanford Research International Successful Reject Ratio System Security Officer Support Vector Machines User to Root UNIX State Transition Analysis Technique

List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11

The simple printer security policy specification with DTEL. . . . . . . . An OTN and RTN generic linked list structure. . . . . . . . . . . . . . . State transition diagram using U ST AT . . . . . . . . . . . . . . . . . . . IDIOT — a Petri net intrusion scenario (from [Kumar & Spafford, 1994]). Event scenario sparse matrix example. . . . . . . . . . . . . . . . . . . . Execution time with GASSATA and our model. . . . . . . . . . . . . . . Intrusion objective: denial of service on an sql server. . . . . . . . . . . . DDoS Attack Architecture [CERT, 1999]. . . . . . . . . . . . . . . . . . DDoS Attack correlation graph. . . . . . . . . . . . . . . . . . . . . . . . . Modeling a DDoS scenario with LAMBDA. . . . . . . . . . . . . . . . . . General architecture of CRIM. . . . . . . . . . . . . . . . . . . . . . . . .

. . 15 . . 19 . . 22 . 24 . . 26 . . 28 . . 38 . . 39 . . 40 . . 41 . . 43

3.1 3.2 3.3 3.4

PCA basic principle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Projection on a new axis (X1 ). . . . . . . . . . . . . . . . . . . . . . . . . . The general architecture of the eigenprofile approach. . . . . . . . . . . . . The projection of the users’ profiles onto the new eigenprofiles space. . . .

4.1 4.2

Data mining as a step in the process of knowledge discovery. . . . . . . . . . 72 Knowledge data discovery process for intrusion detection. . . . . . . . . . . . 75

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11

Bayesian tree example for IP Sweep. . . . . . . . . . . . . . . . . . . . . . EBayes TCP belief tree structure. . . . . . . . . . . . . . . . . . . . . . . . Normal connections as dense regions and attacks as outliers. . . . . . . . . A retropropagation multilayer neural network. . . . . . . . . . . . . . . . . PSP variation according to the number of neurons on the hidden layer. . . . PSP variation according to the considered number of iterations. . . . . . . . Classification process using a threshold. . . . . . . . . . . . . . . . . . . . . PSP variation according to the considered threshold value. . . . . . . . . . . Different classes PSP variation according to the considered threshold value. Different classes ratios detected as a new class. . . . . . . . . . . . . . . . . Ratios of the different attack classes detected as a normal class. . . . . . . .

. . . .

. . . . . . . . . . .

53 54 58 66

97 98 101 107 109 110 110 111 111 112 113

List of Tables 2.1 2.2 2.3 2.4

Printing the /etc/shadow by a malicious user. . . . . . . The database for the sequence < O, R, M, M, O, G, M, C Penetration scenario (from [Ilgun, 1993]). . . . . . . . . Correspondence commands-system calls. . . . . . . . .

. . . . . . >. . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

12 17 21 23

3.1 3.2 3.3 3.4

The different profiles generated by the different users. . Euclidian distances between (one from another) the four Performance of the system in the first experiments. . . Performance of the system in the second strategy. . . . .

. . . . . classes. . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

64 65 68 68

. . . .

82 84 85 86

. . . .

4.1

The different attack types and their corresponding occurrence number respectively in the training and test data sets. . . . . . . . . . . . . . . . . . . . . 4.2 Intrinsic features of the network connection records. . . . . . . . . . . . . . 4.3 Domain knowledge content features of network connection records. . . . . . 4.4 Time based traffic features of network connection records. . . . . . . . . . . 4.5 Confusion matrix obtained with the nearest neighbor algorithm on 125 coordinates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Confusion matrix obtained with the nearest neighbor on 4 coordinates after performing PCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Confusion matrix relative to five classes using the C4.5 algorithm. . . . . . 4.8 Confusion matrix relative to five classes using the C4.5 algorithm after data set projection onto two principal component axes. . . . . . . . . . . . . . . 4.9 Time and tree size with/without PCA. . . . . . . . . . . . . . . . . . . . . . 4.10 The best prediction ratios obtained for each class when considering different principal components. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 Confusion matrix relative to five classes using the C4.5 algorithm after data set projection onto 3, 4, 5, 6 or 7 principal component axes. . . . . . . . . 5.1 5.2 5.3 5.4 5.5 5.6

The cost per test matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix when using the backpropagation technique with the best parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification using the post pruned rules. . . . . . . . . . . . . . . . . . . Confusion Matrix relative using the rules generated by the standard C4.5rules algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix when using the generated rules from the enhanced C4.5 algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Confusion matrix obtained using the enhanced C4.5 algorithm on the initial KDD99 learning database. . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 88 . 89 . 90 . 90 . 91 . 91 . 93 . 105 . 110 . 115 . 116 . 117 . 118

xxii

LIST OF TABLES

5.7

Confusion matrix relative to new R2L attacks using the enhanced C4.5 algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Confusion matrix relative to five classes using the rules generated by the enhanced C4.5 algorithm over the learning database of the second test. . . 5.9 Confusion matrix relative to new R2L attacks using the enhanced C4.5 algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 snmpgetattack attack and normal connection similarity. . . . . . . . . . . .

. 119 . 120 . 121 . 121

Chapter 1

Introduction Recent advances in computers and computer networks have brought many facilities to the modern society. Internet has put the connectivity of people and organizations at a high position that has never been reached before. Nowadays, Internet is largely used in government, military and commercial institutions. The new emerging protocols and new network architectures permit to share, consult, exchange and transfer information from any place all over the world to any other one situated in a different country, continent, sea or ocean. New infrastructures, including DSL (Digital Subscriber Line) lines, have largely permitted the exchange of information between individuals. They created community networks of interest that are strongly distributed, dynamic and are “superimposed” to the current network infrastructures. Despite the above progress, the actual networks are becoming more complex and are designed with functionality while security is not considered as a main goal. For this reason Internet is offering clients fast, easy and cheap communication mechanisms, at the network level, that provide best effort service to the different implemented protocols. Packet loss, reorder or corruption are handled by the higher level transport protocols and implemented at the end points; the sender and the receiver. With this implementation, malicious end users can easily violate the described policy of the different protocols and act to damage the other party. Hence, the end-to-end protocols might be corrupted by any malicious end party where the intermediate network cannot react against this policy violation because their role consists only in forwarding received packets to the end-receiver-point. After introducing computers, automated tools should be available to protect information from our enemies and malicious users. Therefore, designing and implementing best tools to protect the computer systems and the information system accordingly is a task that should be taken by any entity (government, enterprise, family or individual) using any processing equipment in particular when it is connected to a network. The first works done in information system security started in the beginning of the sixties and were focused mainly on access control systems. One of the main goals of national computer security center was to encourage the widespread availability of trusted computer systems [Center, 1987]. The discretionary access control [Center, 1987], one of the primary access control models introduced to secure computers, consists in restricting the access to objects according to the subjects and/or groups to which they belong. This access control is discretionary because a process or a user might pass the information to which he has access to another process or user. Mandatory access control [Bell & LaPadula, 1976], on the other hand, is more restrictive than the DAC model since the protection decisions must not be decided by the object owner. These two access controls are seemingly inappropriate for the

2

Introduction

modern society because many services and applications are becoming more complex. The need for a new access control model that may be used by commercial and government organizations is essential in the modern society where computers and computer networks are highly used. A new access control called role based access control RBAC [Ferraiolo & Kuhn, 1992] is introduced at the beginning of the nineties to meet the needs that are necessary to secure the current information systems. Although the above access control models might be useful in military or civil organizations, they remain restrictive because they apply only to specific applications or systems. Recently, a new access control model called organization based access control Or-BAC [AbouElKalam et al., 2003] is introduced. This model reckons on first order logic and, particularly, on Datalog [Li & Mitchell, 2003]. It has four main principles that make it more interesting than other access control models. First, it provides an abstraction level that helps the specification of high level organization policies regardless the implementation choices. Second, it provides means to evolve in a dynamic environment by introducing the notion of contexts [Cuppens & Mi`ege, 2003b]. In addition, by contrast to the other access control models that only express permissions, Or-BAC provides the notion of negative rules that explicitly specify the forbidden actions that should be taken into account by a modern access control model. Conflicts might occur since opposite rules may be used to express a security policy of an organization. For this reason, conflict management is being investigated [Cuppens & Cuppens, 2005] to prevent the presence of conflicts between the different rules that are specified. Finally, an associated administrative model called AdOr-BAC [Cuppens & Mi`ege, 2003a] is defined by providing special procedures allowing to update the policy and keeping it adequately with the predefined system accuracy. Currently, Or-BAC is limited to the access control part of a security policy but it is being extended to other directions of the computer security such as firewall configuration and reconfiguration, intrusion detection, intrusion prevention, computer management. Despite undeniable progress during the last three decades in computer security in general and in access control in particular, many enterprises still suffer from many attacks that make them lose colossal sums. Different access control mechanisms including firewalls and other based access control models are in place but they neither stop the devastating results of these attacks nor detect them. This situation is critical since an attacker has at least three new computer vulnerabilities per day to exploit against any infected information system. Preventive techniques are not sufficient because of the prolification of the new generation networks where new protocols are emerging. Then new vulnerabilities are more probable and easy to exploit due to configuration, design or programming errors. The old weaknesses in operating systems known three decades before [One, 1989] are still present and exploitable such as the attacks that exploits buffer overflows. This presents a big threat to current operating systems and software. Recently, the slammer worm [Moore et al., 2003] has infected more than 100, 000 vulnerable machines over Internet in less than ten minutes. This attack exploits a weakness in MS-SQL servers where a buffer overflow is possible by only sending a simple packet to the corresponding service on the target vulnerable machine. The target machine then sends thousands of packets per second of the same type to other machines hoping to infect as many machines as possible over Internet.

3 What is Intrusion detection? An Intrusion is defined by Heady et al. [Heady et al., 1990] as: “any set of actions that attempt to compromise the integrity, confidentiality, or availability of a resource”. We refine this definition as the following: “An intrusion is a finite and not empty set of actions, which may be carried out maliciously, to violate the security policy of an information system”. It is a field that has seen its first days in the beginning of the eighties. In order to implement effective intrusion detection systems, many hypotheses should be satisfied. Some of them are: • The different events issued from any information processing in the monitored information system, whatever it is a computer or a network, should be observable. In this case analyzing them may assess the user, network, system or any other application activities. • From the observed events, one can differentiate between what is abnormal (called intrusion or illegitimate events) and what is normal (called legitimate events). • Normal activities should be distinct from abnormal ones so that classification techniques may be used to distinguish between what should be considered as normal and what should not. Appropriate counter-measures may be launched accordingly. • The monitored information system should be under control of the administrator or the site security officer (SSO). He may easily and efficiently launch, automatically or manually, counter-measures when it is necessary in the case where the detection process informs that the system state is critical. The most seminal report that should be cited is that of Anderson [Anderson, 1980] where the author divides the possible attackers of a computer system into four groups: • “External penetrator” The external penetrator has gained access to a computer for which he is not a legitimate user. Anderson uses this definition to include users, e.g. employees of some organizations, having physical access to the building that houses computer resources, even though they are not authorized to use them. • “Masquerader” The masquerader —who can be both an external penetrator and an authorized user of the system— is a user who, having gained access to the system, attempts to use the authentication information of another user. This case is interesting since there is no direct way of differentiating between the legitimate user and the masquerader. • “Misfeasor” The legitimate user can operate as a misfeasor: although he has legitimate access to privileged information, he abuses this privilege to violate the security policy of the installation. • “Clandestine user” The clandestine user operates at a level below the normal auditing mechanisms, perhaps by accessing the system with supervisory privileges. Since there is little, if any, evidence of this type of intrusive activity, this class of penetrator is difficult to detect. Anderson [Anderson, 1980] uses the term “threat” in this same sense and defines it to be the potential possibility of a deliberate unauthorized attempt to: 1. access information,

4

Introduction 2. manipulate information, or 3. render a system unreliable or unusable.

Any definition of intrusion is imprecise as security policy requirements do not always translate into a well-defined set of actions. Intrusion detection systems help computer systems prepare for and deal with attacks. They collect information from a variety of vantage points within computer systems and networks, and analyze this information for symptoms of security problems. Researchers have developed two main approaches for intrusion detection: anomaly and misuse intrusion detection. The former refers to intrusions that can be detected via anomalous behavior and use of computer resources. For example, if user X only uses the computer from his office from 9 AM to 5 P M , an activity on his account late in the night is therefore suspicious and might be an intrusion. Another user might use only text processing applications. A compiler used during his session might be considered as abnormal. Anomaly detection attempts to quantify the usual or acceptable behavior and flags other irregular behaviors as potentially intrusive. In contrast, misuse intrusion detection refers to intrusions that follow well-defined attack patterns to exploit weaknesses in system and application software. Such patterns can be precisely written in advance. Therefore, from this prior knowledge about bad or unacceptable pattern, this technique seeks to detect it directly, as opposed to anomaly intrusion detection, which seeks to detect the complement of normal behavior. This classification provides a grouping of intrusions based on the end effect and the method carrying out the intrusions. Irrespective on how intrusions are classified, the main techniques for detecting them are the same: the statistical approach of anomaly detection, and the precise monitoring of well-known attacks in the misuse detection approach. Both approaches make implicit assumptions about the nature of intrusions that they can detect. Detecting intrusions is not an easy task since many skills, such as computer and engineering knowledge expertise, should be used in order to design and build an effective intrusion detection system (IDS). The different IDSs may be evaluated according to the ratios of the different false positives and false negatives generated by them. In the following, we call false positive an alert that arises indicating that an attack has occurred and in reality this is not the case. A false negative, by contrast, corresponds to an attack that actually occurred but the IDS did not detect it. With the same manner, we call true positive an alert that is launched corresponding to a real attack, and true negative an alert that is not launched when seeing some events corresponding in reality to a normal traffic or behavior. There are many IDSs developed during the past three decades. However, most of the commercial and freeware IDS tools are signature based [Roesch, 1999] corresponding to one particular misuse detection type. Such tools can only detect known attacks previously described by their corresponding signatures. The signature database should be maintained and updated periodically and manually. In addition, current pattern matching techniques used by these signature based IDSs generate high false positive rates, since they are using string matching techniques. They also generate high false negative rates because current hackers knows that the different prevention techniques and the different IDSs, which are in place, are mostly signature based ones. So they perform new attacks that can easily bypass the different security mechanisms in place and, in particular, the different used IDSs.

1.1 Context statement and our approach

1.1

5

Context statement and our approach

We motivate the thesis by discussing a number of open problems related to the previous and current work in the field of intrusion detection. Anomaly detection which is the first technique that was introduced for establishing normal behaviors and then detecting any deviation of the current profile was implemented by many researchers who provided many anomaly tools such as IDES [Javitz & Valdes, 1991], NIDES [Javitz & Valdes, 1993], Hyperview [Debar et al., 1992] etc. Many of these tools look at different user profiles independently each one from the other and use mostly statistics over the different measures that characterize the different profiles. This suggested us that an information theory approach coding and decoding user behaviors may give new information content of user behaviors, emphasizing the significant features. We have introduced a technique based on principal component analysis (PCA) that does not only detect the deviation of a given user/application profile but also detects masqueraders who for example seize legitimate user passwords and penetrate the system for further malicious activities [Bouzida & Gombault, 2003b]. We consider intrusion detection as a knowledge discovery process and used the same multivariate analysis technique based on PCA for space reduction that brings many benefits for misuse detection. The time consumed for learning the signatures of intrusions is reduced with a significant coefficient whereas the true detection rate remains stable. We used KDD99 [KDD Cup, 1999] data sets for our tests where we demonstrated the effectiveness of our approach [Bouzida & Gombault, 2004] in combining a multivariate approach with supervised classification technique for misuse detection using network traffic as a source of analysis. This testbed was provided by the DARPA in 1999. It was the result of a transformation of a simulated network traffic into organized connections for both anomaly and misuse detection. The main problem with this off-line network traffic evaluation data set, collected from the simulated LAN, is the presence of artifacts that made it somehow different from a real traffic. While this raw tcpdump traffic is severely criticized by many researchers and particularly by McHugh in [McHugh, 2000], many other researchers continued to perform their experiments over this data set and particularly over the transformed traffic. This last data set contains two classes of attacks that were never detected in their appropriate class and were always detected as a normal traffic using anomaly detection tools. This situation suggested us focus closely on this data set. We improved some supervised techniques in order to build a single module that includes, in the same time, both anomaly and misuse detection principles. We enhanced the notion of anomaly detection to consider not only normal behavior but also abnormal ones. We applied the enhanced supervised techniques over the transformed data sets and outperformed all previous work done on these data sets. However, some attacks types remain undetectable. As a consequence of this, we implemented a framework that transforms the tcpdump traffic into connections records since the tool (MADAM/ID [Lee et al., 1999, Lee & Stolfo, 2000, Lee, 1999]) that was developed for this task is unavailable. We found that many interesting information are lost during the transformation performed by MADAM/ID. Thereafter, we set some necessary conditions that should be satisfied by a transformation tool in order to build effective intrusion detection models. We have criticized and shown why the different approaches applied over this transformed data did not perform well for some kinds of attacks . Applying new techniques over this data set would never give real results of the new developed techniques because of the information loss resulted after the transformation. Future research on this data set should consider that it has many problems that make it less interesting because of the information loss.

6

Introduction

1.2

Thesis Contributions

This dissertation presents answers to the following questions: • How to model a user, an application, or a network traffic in order to distinguish between normality and abnormality? • Is it possible to reduce the space of the data that characterizes a user, an application or a network traffic without information loss and hence without decreasing false negative rates and without increasing false positives rates? • How to improve machine learning techniques to take into account not only known intrusions but also unknown ones in order to improve the current intrusion detection tools?

1.3

Thesis outline

The rest of the thesis is organized as follows. Chapter 2 examines in depth the different approaches in intrusion detection that are developed during the last decades and gives some critics and suggestions for some of these different techniques. Chapter 3 describes the anomaly intrusion detection technique, which is based on principal component analysis. We have introduced this technique for detecting abnormal user profiles and masqueraders. This chapter presents the different experiments we have conducted over simulated users and web log data. Chapter 4 describes briefly the process of knowledge discovery, introduces our algorithm that uses principal component analysis as a reduction step in the knowledge discovery for misuse detection and presents the different results we obtained over network traffic. Chapter 5 presents the different works done in detecting new attacks, shows the limitations and drawbacks of these detection approaches and presents the different efforts we made over classification techniques in order to detect novel attacks that are not taken into account during the learning step. It then presents the different results that are obtained using the improved techniques. Finally, chapter 6 summarizes the thesis and outlines ideas for future work.

Chapter 2

Different approaches of intrusion detection In this chapter, we present main approaches of the prior work done in the field of intrusion detection. We investigate, in depth, the principal works that have been developed and implemented in the laboratories and in the industry. We briefly describe in the first part a well known work; that of Dorothy Denning [Denning, 1987] since we consider it as the basic model of most current IDSs and it is still accurate even if it was done two decades ago. Then, we present the different techniques used in intrusion detection on the basis of the methods they use to detect anomalies in the monitored information system. We also describe in depth the alert correlation techniques that use the different alerts launched by cooperative IDSs in order to detect as many alerts as possible and then reduce the number of false positives and false negatives by aggregation. Attacker planning activity and intention is also presented by using alert correlation. Finally, the last section presents the main advantages, disadvantages, and critics to the different techniques.

2.1

Introduction

A complete survey of the research in the field of computer and network intrusion detection is beyond the scope of this dissertation because there are currently more than one hundred intrusion detection systems. However, we try, in this chapter, to describe approaches by reporting example of systems that are based on these approaches. There are two approaches, as described in the introduction, to intrusion detection: anomaly detection and misuse detection. An hybrid intrusion detection which uses these two approaches at the same time may also be considered as a third approach to intrusion detection systems. Other techniques are presented in the following chapters since we criticize them and compare our proposed methods to these techniques. IDES [Javitz & Valdes, 1991] and Hyperview [Debar et al., 1992] are discussed in Chapter 3. Other anomaly detection tools for network traffic such as EBayes TCP [Valdes & Skinner, 2000] and unsupervised techniques are discussed in Chapter 5.

2.2

Anomaly based

This approach consists in establishing normal behavior profile for user and system activity or network traffic and observing significant deviations of actual user activity or network

8

Different approaches of intrusion detection

traffic with respect to the established habitual normal profile. Significant deviations are flagged as anomalous and should raise suspicion. Ideally, the daily use of a computer system by a specific user follows a recognizable pattern, which can serve as a characterization of the user identity. The set of application programs, commands, Internet surf, and utilities invoked by a user is often determined by the nature of the job assignment for that user. For instance, a secretary is expected to use a document processing program, e-mail, or calendar management applications. By contrast, a programmer almost always restricts himself to editing source files, compiling and testing programs and so forth. Clearly, the fact that a compiler is suddenly invoked by a secretary should be considered as highly suspicious and hence should be immediately brought to the attention of the security officer who may conclude that an intruder is probably masquerading as a secretary. Moreover, even though a group of users may have the same job assignments, it is still possible to distinguish among their profiles based on their personal habits, such as their session location, normal working hours, or the set of commands they frequently use.

2.2.1

DENNING intrusion detection model

This model [Denning, 1987] has six main components: • Subjects: they are the initiators of the actions in the target system. A subject is typically a terminal user, but might also be a process acting on behalf of users or group of users, or might be the system itself. All activities arise through commands initiated by subjects. Subjects may be grouped into different classes (e.g., user groups) for the purpose of controlling access to objects in the system. Users may overlap. • Objects: they are the targets of actions and typically include such entities as files, programs, messages, records, terminals and user −or program− created structures. When subjects can be recipients of actions (e.g., electronic mail), then those subjects are also considered to be objects of the model. Objects are grouped into classes by type (program, text file, etc.). Additional structure may also be imposed, e.g., records may be grouped into files or database relations; files may be grouped into directories. Different environments may require different object granularity; e.g., for some database applications, granularity at the record level may be desired, whereas for most applications, granularity at the file or directory level may suffice. • Audit records: generated by the target system in response to actions performed or attempted by subjects on objects. They are 6−tuples representing actions performed by subjects on objects: 1. Action: operation performed by the subject on or with the object, e.g., login, logout, read, execute. 2. Exception-condition: denotes which, if any, exception condition is raised on the return. This should be at the actual exception condition raised by the system, not just the apparent exception condition returned to the subject. 3. Resource-usage: list of quantitative elements, where each element gives the amount used of some resource, e.g., number of lines or pages printed, number of records read or written, CPU time or I/O units used, session elapsed time. 4. Time-stamp: unique time/date stamp identifying when the action took place.

2.2 Anomaly based

9

• Profiles: an activity profile characterizes the behavior of a given subject (or set of subjects) with respect to a given object (or set thereof). Observed behavior is characterized by terms of a statistical metric and model. A metric is a random variable x representing a quantitative measure accumulated over a period. The period may be a fixed interval of time (minute, hour, day, week, etc.), or the time between two audit related events (i.e., between login and logout, program initiation and program termination, file open and file close, etc.). Observations (sample points) xi of x obtained from the audit records are used together with a statistical model to determine whether a new observation is abnormal. The statistical model makes no assumptions about the underlying distribution of x; all knowledge about x is obtained from observations. Here are some possible variables which may characterize a profile: 1. number of password failures during a minute, 2. CPU time consumed by a program as an interval timer that runs between program initiation and termination, 3. number of times some command is executed during a login session, 4. number of logins during an hour, etc. • Anomaly records: these records are generated whenever abnormal behavior is detected. Each anomaly record contains three components: event, time-stamp and profile. • Activity rules: An activity rule specifies an action to be taken when an audit record or an anomaly record is generated, or a time period ends. They have the common “condition-action” form. An example is with password failures, where the security officer should be notified immediately of a possible break-in attempt if the number of password failures on the system during some interval of time is abnormal. The purpose of a statistical model, from n observations x1 , x2 , ..., xn for a random variable x, is to determine whether a new observation xn+1 is abnormal with respect to previous observations. Reference [Denning, 1987] proposes many models: 1. Operational model : this model is based on the operational assumption that abnormality can be decided by comparing a new observation of x against fixed limits. Although the previous sample points for x are not used, presumably the limits are determined from prior observations of the same type of variable. The operational model is most applicable to metrics where experience has shown that certain values are frequently linked with intrusions. An example is an event counter for the number of password failures during a brief period, where more than 10, say, suggests an attempted break-in. 2. Mean and Standard Deviation Model : This model is based on the assumption that all we know about x1 , x2 , ..., xn are mean and standard deviation determined from its first two moments: sum = x1 + x2 + . . . + xn (2.1) sumsquares = x21 + x22 + . . . + x2n mean = r stdev =

sum n

sumsquares − mean2 n−1

(2.2) (2.3)

(2.4)

10

Different approaches of intrusion detection A new observation xn+1 is defined to be abnormal if it falls outside a confidence interval that is d standard deviations from the mean: mean ± d × stdev

(2.5)

This model is applicable to event counters, interval timers, and resource measures accumulated over a fixed time interval or between two related events. It has two advantages over an operational model. First, it requires no prior knowledge about normal activity in order to set limits; instead, it learns what constitutes normal activity from its observations, and the confidence intervals automatically reflect this increased knowledge. Second, because the confidence intervals depend on observed data, what is considered to be normal for one user can be considerably abnormal for another. A slight variation on the mean and standard deviation model is to weight the computations, with greater weights placed on more recent values. 3. Multivariate model This model is similar to the mean and standard deviation model except that it is based on correlations among two or more metrics. This model would be useful if experimental data show that better discriminating power can be obtained from combinations of related measures rather than individually —e.g., CPU time and I/O units used by a program, login frequency, and session elapsed time. 4. Markov Process model This model, which applies only to event counters, regards each distinct type of event (audit record) as a state variable, and uses a state transition matrix to characterize the transition frequencies between states (rather than just the frequencies of individual states —i.e., audit records— taken separately). A new observation is defined to be abnormal if its probability as determined by the previous state and the transition matrix is too low. This model might be useful for looking at transitions between certain commands where command sequences were important. 5. Time series model This model, which uses an interval timer together with an event counter or resource measure, takes into account the order and inter-arrival times of the observations x1 , x2 , . . . , xn as well as their values. A new observation is abnormal if its probability of occurring at that time is too low. A time series has the advantage of measuring trends of behavior over time and detecting gradual but significant shifts a behavior, but the disadvantage of being more costly than mean and standard deviation. This theoretical model was proposed by Denning in the eighties and then implemented in IDES (Intrusion Detection Expert System)[Denning & Neumann, 1985, Javitz & Valdes, 1991] tool at SRI. It has the privilege to be the first step in intrusion detection and much of research done in intrusion detection was based on this model. IDES was one of the most significant of the early IDS research efforts. The profiles are considered as statistical models of subject behavior and the system attempts to detect anomalous behaviors; i.e those sufficiently unusual with respect to the profile that is considered as suspect. Many other systems appeared after this model that mostly use the behavior profile idea for anomaly detection. However, at the beginning of the nineties, many research efforts focused on other different directions to detect intrusions. Some of them are discussed and criticized in the following sections.

2.2 Anomaly based

2.2.2

11

Policy based

Defining a security policy is not sufficient to assure normal behavior. A security policy might be inopportunely (because of programming or configuration errors) or intentionally (due to malicious actions) violated. Therefore, it is necessary to ensure the respect of this policy and the detection of any violation. Current policy based techniques and tools may be categorized into prevention techniques (such as access control using firewalls for example) and detection techniques and tools that may announce any violation of the considered policy. In this section, two different methods, which allow the enforcement of the security policy of an operating system such as UNIX, are presented. The first is based on a reference control flow model [Zimmermann et al., 2002] that is quite similar to the DTE (Domain and Type Enforcement) that is introduced in the mid of eighties. The DTE, on the other hand, is the second method that is discussed in this Section. In fact, the domain and type enforcement (DTE) was first introduced by Boebert et al. [Boebert & Kain, 1985] and Thomsen [Thomsen, 1990, Thomsen, 1995] for policy security enforcement and then followed by many of its applications to UNIX operating system [Badger et al., 1995, Badger et al., 1996], firewalls [Oostandorp et al., 2000], etc. At the end of the section, a comparative study of these two techniques is drawn.

Reference flow control model In [Zimmermann et al., 2002], Zimmermann et al. proposed a host based method to detect anomalies based on the definition of the security policy. This technique is deployed to detect symptoms in an operating system [Zimmermann et al., 2002]. Its main objective consists in detecting the execution of some illegal operation as a sequence of other legitimate operations. These attacks are known as race conditions. Current operating systems offer users access rights where each subject has his/her appropriate and precise rights. However, the latter may modify (more precisely extend) his/her rights, in an unpredictable manner, by using some vulnerabilities caused by existing security flaws. For these reasons, the authors suggest that enforcing the operating systems access control by introducing reference flow is of a great interest to detect some new intrusions without establishing, a priori, a normal behavior as suggested by Denning [Denning, 1987] and Anderson [Anderson, 1980]. The definition of operation domains is proposed to match a given security policy, i.e. sets of operations that can be executed and combined in any way without harming the target system security policy. To implement this approach, a practical model using reference flow control is proposed. It is described as the following:

Definitions • References A reference corresponds to the ability to execute an elementary object method call in a specific operation domain. Each reference is tied to a reference bag rather than to a subject or an executing process. More formally, given an object o, a method m and a reference bag S, the reference RS (o.m) is the ability to call a method m on the object o inside the domain represented by S.

12

Different approaches of intrusion detection A bag is modified as long as references are created (for example when a file descriptor is opened) or deleted (for example when a file descriptor is closed). We recall the notation introduced in [Zimmermann et al., 2002]: Let Ω be an operation and S a reference bag such that Ω is legal in S, ΩS corresponds to its execution allowed by references from S. Ref (ΩS ) is the resulting reference bag from ΩS . • Reference flow The goal of the considered attacker here aims at extending his rights to perform operations that raise illegal consequences. Hence, if an operation Ω2 is a consequence1 of another one Ω1 (it is noted Ω1 ⇒ Ω2 ), it should comply with the same security restrictions as those of the first one Ω1 . In reality, this corresponds to transmitting the reference bag of Ω1 to its consequence Ω2 . Thus, processes executing Ω2 will be allowed to execute new operations as consequences of Ω1 . The authors suggest to define Ref rules for the causal dependencies between consecutive and causal operations execution in order to enforce the security policies of attacks by delegation.

To illustrate this technique, let us consider a well known attack, as in [Zimmermann et al., 2002], that was first described by Matt Bishop in 1986. This attack uses some vulnerability when using the lpr command in UNIX. Currently, it is no longer possible because the lpd program itself was modified to prevent abuse of its superuser privileges by adding device and inode number checking. The different steps of this attack are presented in Table 2.1. Printing the /etc/shadow by a malicious user 1- create a temporary file x touch /home/malicious user/x 2- Wait until the printer queue is not empty or block the printer (either by printing many files or a big file, or by using the disable command or disconnect the printer) 3- Queue the temporary file Ω1 ::lpr -s /home/malicious user/x using a symbolic link 4- Remove the temporary file Ω2 ::rm /home/malicious user/x 5- Create a symbolic link noted x Ω3 ::ln -s /etc/shadow /home/malicious user/x in the /home/malicious user/ directory to the password file /etc/shadow 6- Connect back the printer if it was disconnected in step 2 7- Result: /etc/shadow is printed Ω4 ::/etc/shadow is printed by the lpr daemon Table 2.1: Printing the /etc/shadow by a malicious user. Considering the different steps in Table 2.1, there is no policy access rule that is violated. However, the malicious user has exploited a side effect of a legitimate system action which is the symbolic link. These actions have let him/her print the /etc/shadow file to which he/she has no read access right. Using the reference flow model presented above, the authors showed that the execution of Ω4 corresponds to a security policy violation. In the following, we illustrate their analysis on how to detect this violation using reference control flow. 1

The consequences of any operation that is executed in some domain should be allowed to be executed in the same domain.

2.2 Anomaly based

13

To be executed, Ω4 should use a reference bag RS (for Read Shadow ) that contains a read access to the file /etc/shadow. In addition, the malicious user is considered to not have reference bags that contain references that allow him/her to read or access the /etc/shadow file. After executing the action in step 5, the operation Ω3 generates references to the symbolic link. By default, a symbolic link is considered by itself readable, writable and exeuser ) = malicious user ∪ symref s, where symref s cutable. Hence, ref3 = Ref (Ωmalicious 3 corresponds to the symbolic references, resulted from executing the Ω3 operation, that correspond in reality to the read, write and execute rights to the /home/malicious user/x file in the initial malicious user reference bag. In the fifth step, by creating the symbolic link, Ω3 writes to the /home/malicious user/ directory. Ω4 operation has in reality two successive steps. The first step Ω4.1 consists in reading the /home/malicious user/ directory then searching the x file. Thus, 3 Ω3 ⇒ Ω4.1 . based on the following reference flow propagation: ref4.1 = Ref (Ωref 4.1 ) = ref3 ∪ {Rmalicious user (f d.read)} where fd is a file descriptor in the /home/malicious user/ directory. The second step of Ω4 , denoted Ω4.2 , consists in opening the target file. Thereafter, 4.1 should be executed. On one hand, the Ref rules, Ω4.1 ⇒ Ω4.2 . Consequently, Ωref 4.2 proposed by the authors, implement the operation semantics and on the other hand the symbolic links defined in UNIX state that a read operation is permitted on a symbolic link if it is permitted to the file it points to. Therefore, opening /home/malicious user/x for reading requires both RS (/home/malicious user/x.openread) and RS (/etc/shadow.openread). This may not be possible because Ω4.2 cannot be executed as an independent operation as ΩRS 4.2 ref4.1 because RS reference bag does not contain /home/malicious user/x itself and Ω4.2 is not possible since RS (/etc/shadow.openread) ∈ / ref4.1 . However the second operation could be ref4.1 ∪RS , which is by definition an intrusion symptom. possible as Ω4.2 In the following, we show that the domain and type enforcement that was introduced during the eighties has already answered the different objectives of the reference flow [Zimmermann et al., 2002] presented above. We first give a background of DTE then show that it answers to the different problems that the reference flow control model tries to find solutions. Domain and Type Enforcement (DT E) DTE is an access control mechanism that enforces the security policy of an information system by confining applications and restricting information flows. It is derived from the work of Boebert and Kain [Boebert & Kain, 1985] that restricts process access according to a site-specific security policy. DTE associates a domain to each subject (active entity) that corresponds to a running process and a type with each object (passive entity) that corresponds to a file, a packet, or any other entity present in the considered operating system. DTE requires three tables: 1. An implicit typing table that corresponds to assigning a type to each considered passive entity (object). 2. A DDT (Domain Definition Table): This table defines the different authorized interactions between the different domains and types. Each row in this table corresponds to a predefined domain and each column corresponds to a predefined type. Each cell in this

14

Different approaches of intrusion detection table corresponds to the different modes (e.g. read, write, execute, etc.) the process belonging to the considered domain (row ) is authorized to act on the corresponding type (column). If the requested cell is empty then the access attempt is denied. 3. A DIT (Domain Interaction Table): This table corresponds to subject to subject access control that relates domains to domains. If a subject A tries to access another subject B, then the domain of A selects the corresponding column in the DIT and the domain of B selects the corresponding column in the DIT. As with the DDT, if the corresponding cell is empty then the access attempt from A to B is denied.

One feature that differentiates the DTE from prior access control mechanisms [Bell & LaPadula, 1976, Biba, 1975] is its policies specification language called DTEL (for DTE Language) which is flexible and easy to use. To specify and express reusable DTE configurations, DTEL offers four principal constructions: 1. A type statement that declares the different types that will be used in the configuration of the considered DTE policy. 2. A domain statement that defines the access modes a process running in the considered domain is authorized to access (e.g. read, write, execute, etc.) when interacting with objects of specific types or other processes in other domains. An entry point is defined for each domain that must be invoked to enter the domain. This mechanism permits access restrictions to a domain provided that only programs that are believed to employ them appropriately are allowed to execute them. In fact, a process running in a domain A can transition to domain B only by executing one of the B ’s entry points. There are three modes that a process may use in order to transition from a domain to another: • Explicit domain transition An explicit request is possible only if the initial domain has an exec() access transition mode to the target domain. • Automatic domain transition A process may automatically transition from domain A to domain B if its domain contains an auto access mode to the target domain. This mechanism allows the transition between domains to be superimposed on existing daemons and programs without altering the existing programs codes. • Default domain transition This is specified by the null mode which corresponds to the null transition where the process continues its execution in the same domain. 3. An initial domain statement that corresponds to the domain where the system’s initial process (most of the time /etc/init) must start its execution. 4. An assign statement that assigns data types to specific files and directories. DTEL offers two options; one that consists in forcing type assignments to be static at run time and the other that assigns a recursive type statement to all paths having the same assigned statement prefix. A simple DTE specification policy example To illustrate the DTE and DTEL functioning, let us consider a simplified security policy that corresponds to the lpd printing service in a UNIX-like environment. The following rules correspond to the specification of such a security policy: • Only the lpd daemon can read and write to the printer (/dev/printer ).

2.2 Anomaly based

15

• Users may use the printer by using the lpr, lpq and lprm commands2 . The above simplified printing security policy may be translated to a DTE specification as the following: • the lpd d is the domain of the daemon printer. • the lpd exec is the entry point executable to lpd d domain. • the printer (/dev/printer ) and the printing queue (/var/spool/lpd ) are assigned the printer t type. • the lpr domain lpr d 3 is the domain that is used for the lpr, lpq and lprm printing commands. This domain can only create spool files (of printer t type). • the lpr exec is the entry point executable to lpr d domain. Since the lpr command can be used to create a symbolic link to the desired file rather than copying it into the spool directory, the lpd d domain will need to be granted to read a variety of file types. Figure 2.1 shows a snapshot of the considered printer security policy. ... type ..., printer_t, ...; domain lpd_d = (file_prefix/lpd_exec), (rw->printer_t), ...; domain lpr_d = (file_prefix/lpr_exec), (w->printer_t), ... initial_domain = ...; ... assign -r -s printer_t /var/spool/lpd, /dev/printer; ... Figure 2.1: The simple printer security policy specification with DTEL.

Discussions As we mentioned above, the ideas of reference bags approach presented in [Zimmermann et al., 2002] are similar to the domains in DTE. In [Zimmermann, 2003] (pages 47-48), Zimmermann argued that the DTE approach cannot prevent the printing of the /etc/shadow by a malicious user because the DTE does not take into account the symlink in the DDT table. This assumption is actually ambiguous since the link mode access is considered as the read or write mode access. As an application of DTE, Flask [Smalley & Fraser, 2001], a mandatory access control architecture, is being integrated into the Linux kernel where the DTE is one of the components of this architecture. In fact, some 2

For simplification reasons, we only consider the lpr command. This is a simplification, however, there are two different lpr domains: user lpr d and user sysadm d domains. 3

16

Different approaches of intrusion detection

macros are integrated into the architecture to extend the current existing modes (read, write, execute, etc.). One of these macros is link file perms macro that expands to permissions for linking, unlinking and renaming a file. Using this macro resolves the lpr problem of the attack considered in Table 2.1 since it is sufficient to not put the link access mode in the DDT of the users’ domain with the passwd t 4 type. By using the advantage of not assigning the link access mode to the passwd t type in the DDT user domain, step 5 will be denied since it is not specified in the corresponding cell (i.e. l ∈ / DDT [user d, passwd t]) and then the lpr attack will not succeed. The policy based approach is an anomaly based technique that has one more advantage over the other anomaly based techniques. In fact, it does not require an a priori knowledge of a normal behavior. An intrusion or an intrusion symptom consists in a security policy violation. This technique may detect novel attacks that are not known a priori and enforces the security policy of a computer system. However, it may be used in the case where we can specify the security policy which is not always the case.

2.2.3

Immune systems

The immune technique was first introduced by Forrest et al. [Forrest et al., 1996] in 1996. The idea behind this method consists in developing techniques, similar to the body immune systems, that can be used to distinguish between what is normal (corresponding to the cells of the body called “self”) from other foreign cells (called “nonself”). This approach is different from that of Denning [Denning, 1987] in the sense that instead of considering users’ profiles, the characterization of the normal system behavior is taken into account. The short system calls sequences, generated by different applications, are used in order to characterize the normal usage of the system. The algorithm used in this approach tries to build up a profile of normal behavior for a program of interest and then treats any deviation from this profile as an anomaly. In the normal profiles’ building process, called the learning step, a database of these normal processes are generated. This database will be used in the second stage to monitor process behavior to detect any deviation from the normal learned behavior, i.e. the test phase. A normal application behavior is defined by a set of short system calls’ sequences. The sequence length used in their experiments [Forrest et al., 1996] is 5, 6, 7 system calls respectively. To build up the database, a sliding window of size k + 1, across the trace of system calls, is used and the different sequences encountered during the sliding window phase are recorded. To illustrate this method, let us consider the following system calls’ sequence as a normal execution of a certain process: < open, read, mmap, mmap, open, getrlimit, mmap, close > Or < O, R, M, M, O, G, M, C > where “O” stands for “open”, “R” for “read”, etc. If we consider k = 3, then Table 2.2, which presents the corresponding database of our example, is recorded corresponding to the normal behavior of the considered process. After building the database from the normal execution of the considered processes, a trace is tested using the same method by sliding a window of size k + 1 across this new trace. The test stage consists then in determining if the sequence of system calls differs from that recorded in the normal database during the learning step. 4

The passwd t type corresponds to /etc/shadow assigned type.

2.2 Anomaly based system call O R M G C

17 Position 1 {R, G} {M } {M, O, C} {M } {}

Position 2 {M } {M } {O, G} {C} {}

Position 3 {O, C} {O} {G, M } {} {}

Table 2.2: The database for the sequence < O, R, M, M, O, G, M, C >. To illustrate the test phase, let us consider an example, using the constructed database above, and a new system calls trace that differs from the normal one at one location (open instead of mmap as the fourth call in the sequence): < O, R, M, O, O, G, M, C >. This new trace would generate 4 mismatches, since: • open is not followed by open at position 3, • read is not followed by open at position 2, • open is not followed by open at position 1, and • open is not followed by getrlimit at position 2. The evaluation of the method result depends on the percentage of the total possible number of mismatches. The maximum number of mismatches that can occur for a sequence of length L using a sliding window of size k+1 is: (L−1)+(L−2)+. . .+(L−k) = k(L− 1+k 2 ). For the above example where k = 3 and L = 8, we obtain a maximum database of size 18 and the mismatch percentage is 4/18 = 22%. Discussions and other related techniques The proposed method consists in establishing a process normal behavior under real conditions (environment, configuration, etc.) during a certain period. However, this situation brings some drawbacks such as: • Possibility of increasing the false positive rates because not all normal process behaviors are recorded during the learning stage. Indeed, any configuration, environment, etc. changes would generate many false alerts even if the authors have made their experiments in different configurations and situations. • Possibility of increasing the false negative rates if the desired program was hacked during the recording stage. • The time consumed during the learning stage is very high in order to come up with good input data. In [Debar et al., 1998], Debar et al. considered another technique based on variable length patterns and the decision to raise an alert is not based on the Hamming distance as in [Forrest et al., 1996]. They developed two techniques: one is based on fixed length patterns and the other is based on variable length. They have conducted their experiments on the

18

Different approaches of intrusion detection

ftp server. Instead of using system calls, they have used C2 BSM events [C2BSM, 1991]. The idea of using variable length patterns is motivated by the fact that there are very long subsequences that repeat themselves frequently when using the fixed length patterns. This method is defined as a new family of methods based on a suffix tree representation. The testbed of the experiments they used are not available. However, the proposed method minimizes the false negative alarm rate. On the other hand, Eskin et al. [Eskin et al., 2001] presented another method whose idea is similar to a variable length pattern matching. The idea consisted in using a dynamic window size instead of a fixed size as in [Forrest et al., 1996]. The authors introduced two different methods in order to estimate the optimal window size based on the data available in the learning database. The first method is an entropy modeling method which determines the optimal single window size for the data by measuring the regularity of the data. The second method is a probability modeling method that takes into account context dependent window sizes. The authors presented a new method for modeling system call traces using sparse Markov transducers. This method takes advantage of the context dependency of the optimal window size. The method estimates the best window size depending on the specific system calls in the subsequence based on their performance over the training data. According to the experimental results shown in [Eskin et al., 2001], this method outperforms traditional methods in modeling system call data. System calls and audit events are highly used to model normal processes behavior using pattern matching or Hamming distance techniques with fixed length or variables length matching. These anomaly detection methods may be applied only on processes and the traces resulted from their normal execution. Applying these techniques to other computer activities such as network traffic, which is the main source of attacks coming from outside or inside the monitored environment is not an easy task and should be investigated.

2.3

Misuse based

Misuse intrusion detection refers to intrusions that follow well defined attack patterns that exploit weaknesses in system and application software. Such patterns can be precisely written in advance. Therefore, from this prior knowledge about bad or unacceptable behavior, this technique seeks to detect it directly, as opposed to anomaly intrusion detection, which seeks to detect the complement of normal behavior. Most of misuse detection techniques consist of first recording and representing the specific patterns of intrusions that exploit known system vulnerabilities or violate system security policies, then monitoring current activities for such patterns, and reporting the matches. There are several misuse detection algorithms that are introduced during the last two decades. They differ in representation as well as the matching techniques employed to detect intrusion patterns. Most misuse detection techniques are signature based ones. They consist in detecting the existence of substrings, which correspond to an attack signature, in a string that represent the sequences of events caught during the monitoring period. However, the scenario based techniques that were introduced in the 900 s consist of another advance in the field of intrusion detection and in particular in misuse detection. In the following, we try to present the different misuse approaches that are lastly developed. The first consists in presenting an overview of the signature based techniques that are highly used by most of the open source and highly used IDSs such as Snort [Roesch, 1999]. The second step consists in presenting some approaches that are used to detect attacks scenarios, known in the UNIX operating system, using some modeling techniques such as state transition techniques and colored Petri nets.

2.3 Misuse based

2.3.1

19

Signature based

Most of the network based intrusion detection systems rely on exact pattern matching techniques. Depending on the choice of the pattern matching algorithm, implementation and configuration, this pattern matching technique may become a performance bottleneck. In the following, we discuss the network intrusion detection system (NIDS) Snort [Roesch, 1999] because it is probably the largest and most popular open source NIDS available today. It allows users to monitor their networks for signs of hostile activity, as well as performing a set of other tasks such as generic packet sniffing or forensic analysis of network attack traffic. Snort pattern matching technique Snort uses its own data structure in order to match the rules defined a priori with the monitored traffic. It creates a two-dimensional linked list structure that consists of Rule Tree Nodes (RTNs) and Option Tree Nodes(OTNs) [Roesch, 1999]. The two-dimensional linked list (see Figure 2.2) is created according to the different signatures defined a priori in the signatures database. The RTNs contain similar features that are shared by many rules such as source and destination IP addresses, source and destination port numbers, and protocol type (IP, TCP, ICMP, UDP, etc.). The OTNs contain the different options that a rule may hold such as TCP flags, packet payload size, etc. and particularly the most time consuming; payload content.

RTN

RTN

OTN

OTN

RTN

OTN

Figure 2.2: An OTN and RTN generic linked list structure. During the matching stage, packets are examined against a signature database as the following. Each packet is first compared to the RTN list from left to right until it matches a particular RTN. Once an RTN is matched, its corresponding OTNs are compared to the different options that are present with those of the current packet until a match is found.

20

Different approaches of intrusion detection

The options that are present in each OTN are checked one by one. If any of the option fails, the packet is then checked against the following OTN of the current list. The content option is the last option that is considered since it is the most time intensive option to check. If the content check is present in an OTN, then the old versions of Snort (up to 1.9 version) use a Boyer-Moore [Sedgewick, 1997] pattern matching algorithm in order to check the presence of this content string in the current packet payload. If no match exists, Snort will proceed to the next OTN in the list. However, different OTNs may have some part identical in the content list. As an example, assume that one OTN has the following content “graphics/ccc.gif” and the next OTN contains the following string as a content option “graphics/aaa.gif”. Thereafter, if the scanning of the whole packet does not detect any presence of “graphics” string then it is unnecessary to search it again using the second OTN content. Boyer-Moore is a fast string matching in practice that uses two heuristics. The first is commonly referred to as a bad character heuristic. If a seen character that does not exist in the keyword to search for, then the keyword string is shifted forward N characters where N is the length of a given keyword. The second heuristic uses knowledge of repeated substrings in the keyword. If a mismatch occurs and repeated patterns exist in a given keyword, it is able to shift the keyword to the next occurrence of a substring that matches what has already been successfully matched. Boyer-Moore technique was designed for exact string matching where a keyword string is compared to many strings. Many rules in the snort database signatures have contents sharing the same prefixes. Therefore, enhancing the Boyer-Moore technique in order to search the same common prefix in many OTNs in the same time would increase the matching speed up. In fact, Coit et al. [Coit et al., 2001] described and implemented a Boyer-Moore like algorithm applied to a set of keywords held in Aho-Corassick like keyword tree that overlays prefixes of the keywords. Their algorithm is baptized AC BM for Aho-Corassick BoyerMoore since it examines packets from right to left and uses a common prefix approach where various rules that require content searches to be placed in a tree can be searched using element of Boyer-Moore technique. With this new implementation, the authors obtain results 4 times faster than when using the original snort engine (up to the 1.9 version) where all the different rules that are considered contain content option. In the mean time, Fisk et al. [Fisk & Varghese, 2001] developed a hybrid string matching approach that uses three different search algorithms. The first is an improvement of the Boyer-Moore technique, the second is Boyer-Moore-Horspool [Horspool, 1980] and the third is the Aho-Corassick searching algorithm. Their results match many common packets 5 times faster with an average speedup of 50%. The different algorithms from the Los Alamos National Laboratory [Fisk & Varghese, 2001] and that of the Silicon Defense [Coit et al., 2001] exposed above are integrated in the current version of Snort [Snort NIDS, 2005]. On the other hand, Abbes et al. [Abbes et al., 2004] introduced an interesting technique that combines the pattern matching with the protocol analysis for a fast and effective signature based IDS. They showed the performance of their suggested approach that detects many attacks where snort is unable to detect. Of course these algorithms speedup the current search techniques in the monitored traffic. However, as we exposed in the introduction, the problem of attacks actually consist of their novelty that is developed by many hackers who do not exploit the known attacks that approximately all have their corresponding snort signature.

2.3 Misuse based

2.3.2

21

Scenario based

During the 90’s, several vulnerabilities were found in different UNIX like operating systems such as SunOS Solaris, Free BSD etc. These attacks were exploited using some sequences of UNIX commands in order to gain super user privileges. One such scenario consists in gaining the root privilege by exploiting a vulnerability in the mail utility [Ilgun, 1993]. One attack script for exploiting this scenario is the following, described by Koral Ilgun [Ilgun, 1992]: cp /bin/sh /usr/spool/mail/root chmod 4755 /usr/spool/mail/root touch x mail root < x Therefore, many researchers have investigated this domain and developed scenario based techniques in order to detect these attacks which combine several steps in order to reach their goal which mainly consists in gaining super user privileges. In the following, we first focus our discussion only on some of these works such as the USTAT [Ilgun, 1993] from the CERIAS team at the university of Santa Barbara, the second is that system using colored Petri nets developed in Perdue University [Kumar & Spafford, 1994]. Finally, we discuss another system called GASSATA [M´e, 1994] which used genetic algorithms. Since we have investigated this last technique, a thorough presentation of this model and a particular improvement of it is discussed. USTAT—UNIX State Transition Analysis Technique U ST AT [Ilgun, 1993] is a mature prototype implementation of the state transition approach to intrusion detection. State transition analysis takes the view that the computer is initially in some safe state, and after a sequence of actions performed by the attacker leads the target system to a compromised state. U ST AT reads specifications of the state transitions necessary to complete an intrusion, supplied by the SSO (Site Security Officer), and then evaluates an audit trail with respect to these specifications. A penetration example Table 2.3 presents a scenario example for 4.2 BSD UNIX that can be used illegally with which an attacker gains administrative privileges.

Step 1. 2. 3. 4. 5. 6.

A penetration scenario to gain root access Command comment %cp /bin/csh /usr/spool/mail/root assumes no root mail file %chmod 4755 /usr/spool/mail/root make setuid file %touch x create empty file %mail root < x mail root empty file %/usr/spool/mail/root execute setuid-to-root shell root% root shell obtained Table 2.3: Penetration scenario (from [Ilgun, 1993]).

In this scenario, the attacker exploits a flaw in the mail utility, in which mail fails to reset the setuid bit of the file to the current real owner (i.e the attacker), appends a message

22

Different approaches of intrusion detection

and changes the owner. The attacker has thus gained administrative (or root/super-user) privileges. To model this penetration as a sequence of state transitions, one must first define the initial requirement state and the compromised —goal— state.

Attacker mod_setuid(object)

Attacker creates(object)

SR

SC-2

Attacker mod_owneruid(object)

SC-1

1. exists(object)=false

1. owner(object)=attacker

2. Attackerroot

2. setuid(object)=enabled

SC

1. owner(object)=attacker

1. owner(object)attacker

2. setuid(object)=disabled

2. setuid(object)=enabled

Figure 2.3: State transition diagram using U ST AT . To execute the first step over, root’s mail file must not exist, or must be writable. As we progress through the steps in the example, we find that the intruder must have write access to the mail directory, he must also have executed access to cp, chmod, touch and mail. Note that the nature of the penetration in this case is not the execution of the setuid-shell per se. Even if the intruder chose not to execute the command interpreter, there would still be a violation in that there now exists an executable setuid-to-root file on the system that the super-user did not create. The intrusion described above leads to the state transition diagram in Figure 2.3. The different steps of this penetration scenario are represented on the edges leading from the initial requirement state SR to the target compromised state SC . To each step corresponds a set of assertions. For example there are two assertions below state SC−2 mentioning that at this point the considered object (/usr/spool/mail/root) is owned by the attacker and the file’s setuid bit is disabled. These two assertions represent the state of the system just after creating an empty file x (step 3 in Table 2.3) that corresponds to the action described on the edge between the first two steps (SR and SC−2 ) and before executing the action in step 4 in Table 2.3. Note how the intrusion scenario above has been stripped of many assumptions about what the nature of the intrusion is, e.g. the fact that the file /usr/spool/mail/root is a copy of the command interpreter csh. This information is not necessary to detect the violation. The first step, that of creating /usr/spool/mail/root, is paramount to detecting this intrusion however, it is not of vital importance how this file is created, or what it contains. Thus, the state transition diagram has obstructed away from the intrusion in such a way as to allow the diagram to represent variations of the same intrusion scenario, that a more straightforward, simplistic, signature based intrusion detection system may fail to detect.

2.3 Misuse based

23

The authors [Ilgun, 1993] evaluated the prototype against two kinds of tests, function as well as performance. The prototype was put against a number of possible intrusion scenarios, and variation thereof, where the attacks were performed by several attackers in unison, using hard links to files, instead of the original file names etc. These tests demonstrated that U ST AT indeed managed to detect intrusions under these circumstances. The prototype was run on a single workstation that also performed the audit collection. These tests indicated that under light load, U ST AT kept up well with the stream of audit records, but when audit intensive applications such as the command find were run, U ST AT did not perform as well. U ST AT consumed approximately 13% of the CP U , and the bottleneck was identified as being the disk to which both the audit facility stored audit records, and from which U ST AT attempted to read those same records. We should mention that, at the beginning, the goal of this system was the representation of attack scenarios. One problem that was mentioned by the authors is that the representation of actions is only sequential. In fact, for the signature described in Figure 2.3, a parallel representation is more adequate. This was realized using Colored Petri Nets [Jensen, 1997] that was the work done in the thesis of Kumar [Kumar, 1995] whose system is described in the following paragraph. IDIOT—Intrusion detection using petri-nets IDIOT “Intrusion Detection In Our Time” [Kumar & Spafford, 1994, Kumar, 1995] is a system developed at COAST project, Perdue University, IN, USA. Kumar [Kumar & Spafford, 1994, Kumar, 1995] employed Petri Nets for signature based intrusion detection. Each intrusion signature is presented as a Colored Petri Net. Figure 2.4 describes the attack presented above (obtaining a super user shell) using a colored petri net. Correspondence commands-system calls Command System call %cp /bin/sh /usr/spool/mail/root w rite %chmod 4755 /usr/spool/mail/root chmod %touch x stat ; utime %mail root < x exec Table 2.4: Correspondence commands-system calls.

The horizontal branch corresponds to the following two commands: %cp %chmod

/bin/sh 4755

/usr/spool/mail/root /usr/spool/mail/root

While the diagonal branch presents the following command: %touch

x

Each scenario described by a petri net may have more than one initial place and only one final place. The evaluation begins by placing a token in each start state. These tokens “flow” through the Petri net. These transitions may be guarded by Boolean expressions that must evaluate to true for the transition to take place. A system call is associated to each

24

Different approaches of intrusion detection S4

F=true_name(this[OBJ])

stat Intrusion scenario S5

t4

cp /bin/sh/spool/mail/root chmod 4755 /usr/spool/mail/root mail root 0) is the weight associated to attack i (Ri is proportional to the risk of intrusion whenever the scenario i is present). Let O (O for Observed) be an N e − dimensional vector corresponding to the current audit trail. Let H (for Hypothesis) be a binary N a − dimensional vector, where Hi = 1 if the attack i is assumed to be present and Hi = 0 otherwise (H describes a particular subset). The simplified security audit trail problem consists in determining the hypothesis vector H which maximizes the R × H product subject to the constraint (AE × H)i ≤ Oi (1 ≤ i ≤ N e) (Equation (2.6) from [M´e, 1994]).          1 i     Ri           1   1                i                 Ne

 Na



     Hi  = M aximum    

j

Na 

AEij

1



          Hj          Na

1

j



Na

1



i

         Oj         

Ne

Ne

1



(2.6)

This problem consists in finding the hypothesis vector H that maximizes the risk of intrusion according to the observed audit trail (i.e. finding H such that the corresponding risk is the greatest according to the observed audit events O). In [M´e, 1994], the author demonstrated that this problem (Equation (2.6)) is N P − complete, then he used an heuristic approach using genetic algorithms [Goldberg, 1991] to solve it. In [Bouzida & Gombault, 2003a], we have improved this method by simplifying the optimization problem described in Equation (2.6) as follows. The different parameters we used in our formalization are: • Let N e be the number of auditable system events and N s the number of potential known attack scenarios, • Let AES be an N e × N s Attack Events Scenario sparse matrix which gives the set of events generated by each attack scenario. AESij is the number of auditable events of

26

Different approaches of intrusion detection type i generated by the attack scenario j (AESij ≥ 0) (see Figure(2.5) for an example of such a matrix). AESi is the ith column vector of AES that represents the ith attack scenario, • Let Ob be an N e − dimensional vector where Obi counts the number of events of type i that are present in the audit trail (Ob is called “observed audit vector”).

Attack Scenario 1Attack Scenario 2 Attack Scenario 3 Attack Scenario 4

number of events

5

Line number

1

3

6

2

2

2

2

1

2

3

3

2

1

8

4

4

4

4 3

2 5

Figure 2.5: Event scenario sparse matrix example. Let I be an N s − dimensional positive integer vector, where Ii is the number of occurrences of the ith attack scenario (I describes all the attacks that are present in the audited file). To capture the manifestation of one or more attacks contained in the audit trail (i.e. PN s Ob), we have to find the I vector which maximizes the sum i=1 Ii (find I that maximizes the risk); subject to the constraint Ii × AES i ≤ Ob; (1 ≤ i ≤ N s) (see Equation (2.7)).

 PN s  M ax( i=1 Ii ) s.t.  (Ii × AES i ≤ Ob)i=1,..N s

(2.7)

where AESi is the ith column of AES, which represents the ith attack scenario of the sparse matrix AES. I is an N s − dimensional positive integer vector. It is clear that system (2.7) is a polynomial problem with a complexity O(N eN s) and its resolution is very simple:

Ii =

min b

j=1,...,N e

Obj c, AESji

where AESji 6= 0

(2.8)

2.3 Misuse based

27

To illustrate our model, let us consider the following simple example:     5 0 0 0 10  3 6 2 0   8        AES =  and Ob =   1 2 4 0   5   2 1 0 8   9  2 0 0 0 4 The solution to this problem (using Equation (2.8)) is   2  1   Imax =   1  1 Some shortcomings of the method described in [M´e, 1994], which do not appear in our model, are: • by using a binary coding for the individuals, it cannot detect the multiple realization of a particular attack, • if the same event or group of events occur(s) in several attack scenarios, a malicious intruder realizing these attacks simultaneously without duplicating this event or group of events, it fails to find the actual attacks, • by using Genetic Algorithms, if there is more than one optimum solution, it provides randomly one of them, and • if an intruder knows the period of a session, he can perform an attack during two or more different sessions, it will also fail to detect this attack. ([M´e, 1994] proposed to execute GASSATA whenever possible —every 18 seconds for example— by considering the whole audit trail from the beginning of the user’s session). To illustrate the first three drawbacks, let us apply the algorithm to the above example, the optimum H vector using Equation(2.6) will be one of the following:       0 0 1  1   0   0       H1 =   1  , H2 =  0  , or H3 =  1  1 1 0 In addition to these drawbacks, if the number of attack scenarios is great enough, the execution time using Genetic Algorithms will be too huge and the solution given to the fourth disadvantage will not be realistic. Figure (2.6) shows the execution time in seconds versus number of attacks present in the Attack-Events matrix. a) when using an heuristic method (from [M´e, 1994]). b) when using our model. This comparison is not performed on a real network with real users. It is only a simulation of some users’ behavior after introducing some known attacks in the audited user’s behavior. The experimentation showed that our model finds the solutions in real time, presents less false negatives and finds exactly all the attacks that are added in the audit trails. Unlike the different scenario models, cited above, that take into account the order and temporal constraints of the different audited events, GASSATA and then our model do

Different approaches of intrusion detection

700 600 500 400 300 200 100 0

a)GASSATA b)Our Model

0 00

15

00 70

00 30

0 40

0 15

70

attacks number

40

24

execution time in sec.

28

Figure 2.6: Execution time with GASSATA and our model. not take into account these constraints. Of course, this does not refine the description of attack scenarios. However, our model and its detection algorithm is efficient and fast enough compared to GASSATA and may be used in real time since its corresponding execution time is fast enough. This new formalism could be used to detect attacks combined by more than one user. In addition to this, it may be used as a pre-model for a presumption of attack presence in the audit files.

2.4

Cooperative approaches

At the end of 90’s, many commercial and open source intrusion detection systems have been developed and made freely available. Their deployment in operational infrastructure has shown their limitations. The main problem consists in the alerts excess produced by these IDSs where the security officer could not investigate all the huge amounts of alerts neither could he cope with the load of these alerts. Another problem consists in the scantiness of alerts that could not be easily diagnosed by the operator even if they originate from signature or misuse sensors that, by definition, produce more synthetic and meaningful alerts. Therefore, the operator should investigate, manually, the different events that have caused the considered alerts. He should also face the denial of service that, actually, is a result of the different sensors that may be results of some tools that are intentionally developed in order to flood IDSs with packets that generate false alerts. Since multiple IDSs are being used to detect as many alerts as possible all over the network infrastructure, the operator is subverted by many alerts originating from many sensors that do not provide techniques that may help operators to group those alerts that share the same context features.

2.4 Cooperative approaches

29

In addition to the above problems, many IDSs generate huge amounts of alerts; that may be false positives or true positives. Coping with all of these alerts is not an easy task for the security officer. Therefore, many researchers have investigated the field and have introduced many different ideas to reduce and cooperate the existing IDS tools in order to provide only valuable information to the SSO. In reality, intrusion detection community has brought the idea of managing alerts from the network surveillance where alert correlation is first used in order to reduce the workload of the operator [Jakobson & Weissman, 1993]. Telecommunication networks produce thousands of alarms per day that make the task of the real-time network surveillance and fault management difficult. Based on this observation and the actual need for real time alert assessment, Jakobson et al. [Jakobson & Weissman, 1993] introduced alert correlation to cope with this phenomenon. In [Jakobson & Weissman, 1993], alarm correlation is defined as a conceptual interpretation of multiple alarms such that a new meaning is assigned to these alarms. The following paragraphs discuss the main advances the intrusion detection field has seen during this last decade in how to cooperate the different IDS tools using alert aggregation and alert correlation. We also investigate some cooperative architectures developed in this sense.

2.4.1

Alert aggregation and fusion

As in the different results and implementations of alert aggregation and correlation work done, in the Mirador5 project, by Cuppens [Cuppens & Mi`ege, 2002, Cuppens, 2001b], we differentiate the notion of alert aggregation from alert correlation. Alert aggregation and fusion focus on how to find similarities between different alerts that are generated by different sensors. These sensors are geographically distributed over Internet or locally implemented in distinct places over a local area network. In fact, many alerts may be received from different kinds of sensors that are the result of the same event. Many IDSs may also generate different alerts corresponding to the same event. The process of grouping and then generating a global alarm that corresponds to a general representation of the whole alerts that are coalesced into a global and generic one is then called alert aggregation or fusion. On the other hand, the correlation function consists in analyzing the alerts clusters provided as outputs by the merging function. The objective of the correlation function consists in correlating alerts in order to provide the security officer with some synthetic information. However, almost all papers in literature do not differentiate between correlation and aggregation and use these words intermittently. There are many reasons that have suggested researchers introduce and then investigate the aggregation direction in intrusion detection. This is, generally, due to the practical results after using heterogeneous IDSs in a network infrastructure. Let us illustrate this by an example of a network where many different IDSs are implemented separately all over the considered network. Some considerations should be taken on the basis of the following remarks: • Many alerts may be generated for the same event because either each of the IDSs is detecting it as an alert or the same IDS may generate many different alerts for the same event. These alerts may have different classification names and different alert 5

A DGA funded project from 1999 to 2001.

30

Different approaches of intrusion detection contents due to their corresponding assigned names and fields fulfilled, accordingly, by the different IDSs’ that are deployed for this task. Therefore grouping or aggregating these alerts into one global alert, which best describes the different alerts collected from one or several IDSs, is interesting for a better description of the alert that may be easily analyzed by the operator or the next steps of the detection procedure in order to follow the attack plan activity for example. • Let us consider a probing attack that consists of IP sweeping or port sweeping whose generated packets may have a destination address to the same subnetwork IP address and a unique source address from the same attacker subnetwork. In this situation, an alert grouping all alerts is more adequate to specify and resume the corresponding probe attack. • An attacker who forges IP source addresses to flood one target IP address may generate many alerts. Here also a grouping alert strategy is useful to reduce the number of alerts corresponding to the same flooding attack. The same strategy should also be taken into account from alerts originating from many different sources and targeting one victim. • Alerts generated from many IDSs, whose classification names are different from one sensor to another, should be grouped in order to generate one alert so that the operator may have a good vision of the event that happened. Hence, it is necessary to maintain a database for alerts correspondence between IDSs.

From these points, alert aggregation should be performed just after receiving the different alerts generated by the different IDSs. Thereafter, based on the global alerts grouped during the aggregation phase, other techniques may be used to follow the planning activity of an attacker to reach his goal for example. Many aggregation techniques have been investigated. We focus our discussion on some of them. Valdes and Skinner [Valdes & Skinner, 2001] proposed a probabilistic correlation approach. Debar and Wespi [Debar & Wespi, 2001] presented an aggregation and correlation (AC) algorithm, Julisch [Julisch, 2002] proposed a root cause analysis for alert aggregation and Cuppens [Cuppens, 2001b] used similarity expert rules to merge alerts from different sensors. Probabilistic approach Valdes and Skinner [Valdes & Skinner, 2001] introduced a similarity approach that focuses on how to aggregate two alerts. In fact, a set of meta alerts composed of several alerts almost from different host and network sensors is maintained. A new alert is aggregated with a meta alert set if they share similar features values. These features correspond to some fields of an alert format such as those defined in the IDMEF IETF standard [Debar et al., 2005]. Not all the fields characterizing an alert are taken into account since, for example, a network IDS does not provide a process ID whereas this feature is mostly present in the case of a host based one. For each considered feature, a similarity function, which returns a number between 0 and 1, is defined. An expectation similarity, whose value is between 0 and 1, is also used. The expectation similarity is used to enforce/reduce the aggregation matching process of the meta alert with a new alert. For example a syn-flood would be more probable to be aggregated with an earlier probe even the source address does not match (that is the expectation of match of IP addresses is low).

2.4 Cooperative approaches

31

Valdes and Skinner [Valdes & Skinner, 2001] used the following similarity function to compute the overall similarity between a new alert and a meta alert: P j Ej SIM (Xj , Yj ) P SIM (X, Y ) = (2.9) j Ej where X is the candidate meta alert for matching, Y is the considered new alert, Ej corresponds to the expectation for feature j, Xj and Yj are the different values for feature j in alerts X and Y , respectively (may be a list of values) and SIM is the similarity matrix that is maintained between attack classes whose off-diagonal values heuristically express similarity between the attack classes. A local threshold is defined for each attribute that enforces the minimum similarity between the two considered alerts for aggregation. Thus, if an attribute expectation value is less than its corresponding local threshold then the new alert cannot be considered as a candidate for aggregation where the overall similarity is considered null. On the other hand, a second threshold corresponding to the overall similarity is also considered in order to aggregate the two alerts using the above similarity equation. The main drawback of this method is the empirical definition of the similarity matrix. The heuristic technique with which this matrix is fulfilled is biased since we have a priori knowledge about attack classes. It is not completely implicit since we maintain a matrix whose values are fixed a priori between the different attacks. Of course, their technique is only based on known attacks. However, if we consider anomaly detection IDSs that do not provide the attack class in the generated alerts content, the proposed method will obviously fail to aggregate the novel attack that is detected since there is no entry in the similarity matrix for this new attack. In [Autrel & Cuppens, 2005], Autrel and Cuppens presented an alert aggregation method that is based on a similarity function that returns a real value between [0 − 1]. A similarity value, between two alerts, of 0 means that “the two considered alerts are not related to the same event” and a value of 1 means that “the two alerts have been probably generated upon the same event”. In order to calculate a similarity between two events, one should calculate the similarity values between the alerts’ attributes and aggregate them to obtain a final similarity value. For each alert attribute corresponds a function that may be used to calculate the similarity between these two attributes. They also provide a default weight that may be used to calculate the aggregation value of the two alerts. In the case where two alerts’ associated events are not of the same type, a predefined function returns a default weight associated to an attribute type. Using this default value, one may aggregate alerts even if the corresponding alert type is unknown. This technique makes the aggregation, of alerts generated by anomaly and misuse detection tools, possible. Aggregation and Correlation (AC) Algorithm Debar and Wespi [Debar & Wespi, 2001] integrated into the commercial Tivoli Enterprise Console an aggregation and correlation (AC) algorithm whose goal is to form groups of alerts by creating a small number of relationships that are provided to the operator instead of expressing the raw alerts. The AC algorithm has two different relationship kinds. The first is the correlation relationship where events in relationship are considered the same trend of attacks and are processed together. The AC algorithm uses explicit rules programmed a priori or drives them from configuration information in order to realize this first relationship. This first relationship between events consists of duplicates and consequences alerts.

32

Different approaches of intrusion detection

The duplicates alerts deal with alerts that are logically linked with each other either using a backward-looking (i.e duplicate alert that corresponds to an alert that is already taken by the AC algorithm) or a forward-looking (i.e the current alert must be followed by others where consequence alerts have to be expected). The duplicates alerts focus on common attributes (source IP, source port, target IP, target port and times that are close to each other). The AC algorithm uses a configuration where duplicate definition consists of four associated terms: • Initial alert class that corresponds to the class that is received earlier. • Duplicate alert class which corresponds to the class of the current alert that is being evaluated in order to be a duplicate of the first one. • Attributes list corresponding to the attributes of two alerts that should be equal for the two events to be considered as duplicates. • Severity level A new severity level for the duplicate alert that the AC algorithm uses for further processing instead of the first one. The second is aggregation relationship. Since there are many cases where isolated events are not considered significant, many situations may happen. The authors [Debar & Wespi, 2001] used three different axes, namely the source, target and class to aggregate these isolated events. By considering each feature or some of them, many situations occur. In fact, a wildcard may be used for each axis. Based on this, the aggregation is performed and different attacks in different situations are detected accordingly. The aggregation and correlation (AC) algorithm is used to group and correlate consequences of alert in order to reduce the number of alerts and to construct scenarios of attacks. The used algorithm is empirical since it compares different alerts according to the different considered attributes in order to aggregate or correlate them. However, the introduction of wildcards in the aggregation relationship is of a great interest and could be used in order to aggregate alerts coming from anomaly detection systems. Root causes Julisch [Julisch, 2001] proposed a mining technique in order to find out the reasons why some alerts are triggered. These reasons are called root causes. He has experimented his method on a set of more than 150, 000 alerts and showed that 90% of these alerts are due to bad configuration. He claims that removing root causes of these alerts reduces the future alarm load by 82%. The technique presented is easy to implement. Each alarm is described by the different categorical, numerical, time and string attributes. Each of these attributes is then presented by a tree (called taxonomy). The idea consisted in formalizing an optimization problem after collecting all alerts, described using the above taxonomies, in one set during one month for example. Therefore, the proposed clustering algorithm based on unfunded theory is performed over the collected set in order to find out the different clusters of alerts in order to group them into different sets where these new groups should be investigated in order to find out the reason of their presence. The objective of this technique does not consist in finding new attack scenarios or following the activity of attacker but consists only in trying to find out the principal reasons that thousand alerts are triggered and then fix those misconfigurations.

2.4 Cooperative approaches

33

The number of alerts that are due to equipment misconfiguration are less considerable in the case where all equipment in the monitored network is well configured. In such a situation, this technique would be no longer useful. Similarity expert rules Cuppens [Cuppens, 2001b] proposed a cooperative technique that takes into account heterogeneous IDSs installed in different places over the monitored network. He defined three principal steps in order to reach the aggregation objective: • Alert management function The messages that are generated by the different IDSs are stored and managed in a relational database. • Alert clustering function Alerts corresponding to the same occurrence of an attack are recognized and gathered into the same cluster. • Alert merging function A global alert, corresponding to each cluster, is generated using a predefined clustering function. One should mention here that the aggregation technique suggested by Cuppens is a step that manages the different alerts. This step precedes the correlation tasks that will use those alerts clustered during the aggregation step in order to follow the attacker plan and construct in real time the ongoing attack scenario. The correlation step is discussed in the following section. First, a relational database, implemented in prolog, is used to convert and store the different IDMEF alerts generated by the different IDSs. The merging function can then access to this database and generates clusters from the stored alarms. The clustering technique, which is based on an expert system that specifies expert rules for specifying similarity requirements, use the different alerts already stored in the database and the current alert to evaluate. The idea of evaluating new alerts with already constructed alerts groups (meta alerts) is the same as that of Valdes and Skinner [Valdes & Skinner, 2001] but the technique used behind is not based on a probabilistic approach for similarity aggregation. In fact, he defined a similarity relation with which two alerts could be considered sufficiently close in order to be clustered. This corresponds to predicates that describe similarities between IDMEF DTD entities and attributes. This approach is generic since it is sufficient to use the entity similarity predicate if an entity is added to the DTD and with the same manner if a new attribute is introduced, the attribute similarity predicate is used to compare the similarity between the values of the attributes of considered two alerts instances. He defined four expert rules similarities. These rules are domain specific. Since the approach considered is a cooperative one, the analysis of the different alerts generated from the heterogeneous IDSs is necessary. In fact, these rules correspond to specify in which case the classification, time, source and target entities are similar. Conflict resolution and alert merging is the other step that is used in order to represent the clusters, identified in the first step using the different similarity rules, by one global alert. The different experiments were performed using the different alerts provided by two different IDSs; Snort [Snort NIDS, 2005] and E-Trust [Computer-Associates, 2000]. Over 87 attacks that are considered, 18 were not detected. However, the two IDSs generated

34

Different approaches of intrusion detection

325 alerts and the clustering technique presented above has reduced the number of alerts to 101. The aggregation technique should generate 69 attacks as expected. The author explained that this is due to the fact that some attacks are considered elementary which is not the case. Therefore, the IDSs have generated more alerts than attacks corresponding to the elementary steps of the remaining (complex) attacks. In addition, 6 clusters were false positives due to the temporal delay (after which two alerts will not be considered similar). However, the author suggested that the remaining attacks may be detected using alert correlation which is discussed in the following paragraphs.

2.4.2

Alert correlation

In this paragraph, we present some work done in correlating alerts generated by different IDSs. In fact, we mean by alert correlation the technique that may be used to find out the ongoing attack scenarios based on the different alerts provided by the different deployed IDSs. We expose briefly the three different approaches to alert correlation; explicit, semiexplicit and implicit alert correlation. Explicit alert correlation This approach may be used when the operator can express the relations between the different events that he knows. Therefore, the different attack scenarios may be described a priori and then compared against the different alerts that are generated. In fact, this approach is very similar to the different approaches described in the scenario based techniques (see Section 2.3.2) that were developed during the last decade. However, alerts are used instead of UNIX commands used in the scenario based techniques. There are many approaches of this category that are proposed in the literature. In the following, we briefly examine the AC algorithm [Debar & Wespi, 2001], the chronicles [Morin & Debar, 2003] and two attack languages that specify attacks scenarios In Section 2.4.1, we have presented the aggregation part of the AC algorithm [Debar & Wespi, 2001]. The second part of this algorithm is alert correlation where explicit rules are programmed into the Aggregation and Correlation Components to create relationship between the different alerts. In fact, the consequences alerts correspond to linking the aggregated alerts in a given order where the link should happen within a given time interval. As in duplicates, a configuration file is used to define the consequences. The consequence definition consists of six associated terms: • Initial alert class corresponds to the class of the original alert. • Initial probe token indicates the probe6 from which the alert comes. This can be a variable (in which case it should be the same variable as in the fourth term, to indicate that the only constraint is that the two probes are one and the same) or a wildcard. • Consequence alert class represents the class of the alert considered as a consequence of the first one. • Consequence probe token indicates the probe from which the alert comes. This can be a variable (in which case it should be the same variable as in the second term, to indicate that the only constraint is that the two probes are one and the same) or a wildcard. 6

The term probe here corresponds to any available commercial or open source IDS.

2.4 Cooperative approaches

35

• Severity level corresponds to a severity level for the signature simulating the missed alert. • Wait period indicates the time to wait for the consequence alert. During the consequence process, the current alert is verified if it is a consequence of a previously processed alert. If this condition is verified, the link is marked. When an event occurs, the consequence definition is examined to verify whether the current alert is a consequence of some previously received alert. If this is the case, the consequence definition is stored. This system does not allow to specify directly an attack scenario composed of more than two steps. However, the specification of a set of the different consequence definitions stores the different steps that compose a given scenario. In [Morin & Debar, 2003], Debar et al. used chronicles to explicitly express attack scenarios. Chronicles [Dousson, 1994] are used in dynamic systems and are evolving according to the monitored system state. There are two main constraints in chronicles; time and events occurrence. In fact, chronicles take as input the different events provided by the monitored system and generates the actions that should be performed when the chronicle is recognized. Each chronicle may be seen as a set of events on which a set of temporal and contextual constraints applies. If the observed events correspond to those defined in the chronicle and the different constraints are satisfied then a corresponding chronicle instance is recognized. In the case of alert correlation presented in [Morin & Debar, 2003], the input events of the chronicles correspond to the different intrusion detection alerts. To specify chronicles, the scenarios should be known a priori. The operator should then explicitly maintain a database of the different known scenarios. In [Michel & M´e, 2001], M´e et al. proposed a high level language called Adele that is used to specify attacks scenarios. Each step of the scenario is recognized when the corresponding alert is generated. The objective of specifying attacks scenarios rather than alerts scenarios resides in the abstraction level of the attacks scenarios. Therefore, an operator may imagine some attacks scenarios and specifies them using this language for possible matches when receiving the alerts from the different probes. Just before, Cuppens et al. [Cuppens & Ortalo, 2000] introduced a declarative language called LAM BDA (for Language to Model a Database for Detection of Attacks) which is detailed in the following paragraph. The two languages LAMBDA and Adele were both introduced during the Mirador project. There are many common features between the two languages. Both of them specify attacks scenarios. However, while the latter uses a procedural approach the former uses a declarative approach. On the other hand, LAMBDA is also used as a language to model attacks for the semi explicit correlation approach proposed by Cuppens [Cuppens & Mi`ege, 2002] presented in the following paragraph. Another difference of Adele is its ability to specify reactions against the detected attacks. While the reactions are specified a priori for their corresponding attack if there is any, LAMBDA is much more flexible since it selects automatic and on line counter measures. Semi-explicit alert correlation Maintaining a database of scenarios when using explicit alert correlation is a huge task since there are numerous scenarios that should be taken into account and the attackers may change their behaviors in order to mislead the scenario recognition system. In addition, many attackers try to get some information about the target system then launch some attacks using that knowledge acquired in the first step and so on. To cope with this situations, a new

36

Different approaches of intrusion detection

correlation method that does not maintain the different scenarios during the recognition step is introduced. In [Cuppens & Mi`ege, 2002], Cuppens proposed a semi explicit technique for correlating alerts using the LAM BDA language. A LAM BDA description of an action is composed of the following attributes: • pre-condition: defines the state of the system needed in order to achieve the action. It is expressed with a conjunction of the different involved predicates. • post-condition: defines the state of the system after the execution of the action that is also expressed with a conjunction of the different involved predicates. • scenario: the combination of the different events involved in the scenario describing the attack. • detection: describes the expected alert upon the detection of the action. • verification: specifies the condition to verify the success of the action. Two actions A and B are correlated when the realization of A has a positive influence over the realization of B (given that A occurred before B). More formally, let post(A) be the set of post-conditions of action A and pre(B) be the set of pre-conditions of action B. Therefore, A and B are directly correlated if the following conditions are satisfied: ∃Ea and Eb such that: (Ea ∈ post(A) ∧ Eb ∈ pre(B)) or (not(Ea ) ∈ post(A) ∧ not(Eb ) ∈ pre(B)) where Ea and Eb are unifiable through a most general unifier θ. To illustrate this semi explicit technique, let us consider the following elementary attacks that represent a part of an attack scenario: 1. Attack rpcinf o(Attacker, Address) Pre: remote access(Attacker, Address) ∧ use service(Address, mountd) knows(Attacker, use service(Address, mountd)) Detection: C lassification(Alert,’rpcinfo’)∧source(Alert, Attacker) ∧target(Alert, Address)

2. Attack showmount(Attacker, Address) Pre: remote access(Attacker, Address) ∧use service(Address, mountd) ∧mounted partition(Address, P artition) Post: knows(Attacker, mounted partition(Address, P artition)) Detection: C lassification(Alert,’showmount’)∧source(Alert, Attacker) ∧target(Alert, Address)

Sometimes the effect of an attack corresponds only to a knowledge gain of an attacker. To express this knowledge a meta predicate denoted knows is used. In the example above, if U ser is the attacker then knows(Attacker, use service(Address, mountd)) means that Attacker knows that the N F Smountd service is available on the host whose address is “Address”.

2.4 Cooperative approaches

37

The effect of attack rpcinf o(Attacker, Address) is that the attacker knows that the host target system runs the NFS daemon (knows(Attacker, use service(Address, mountd))) and since the condition (use service(Address, mountd))) appears in the pre-conditions of the second attack one can infer that these two attacks may be correlated. The above correlation is called direct correlation since the correlation is done directly between a post-condition expression of an attack with a post-condition of another attack as in the above example. On the other hand, Cuppens has proposed to introduce ontological rules to represent possible relations between predicates. As in attack modeling with LAM BDA, these rules are also presented using pre and post conditions. To illustrate this idea, let us consider the Winnuke attack that corresponds to a denial of service against Windows machines. One of the pre-condition predicates of this attack is use os(T arget Host,0 windows0 ). The attacker A has also a tool that permits him to scan a remote host for available services. If we consider the T CP Scan as an attack then one post condition of this attack after using the scanning tool would be knows(A, use service(T arget Address,0 netbios0 )). These two attacks could not be correlated since knows(A, use service(T arget Address,0 netbios0 )) cannot be unified with use os(H,0 windows0 ). However, the netbios service is proper to windows operating systems. Therefore, this expert knowledge is specified by an ontological rule whose pre-condition is use service(H,0 netbios0 ) and post-condition is use os(H,0 windows0 ). Then, the notion of correlation is generalized by considering a chain of correlated ontological rules which may be correlated to two other attacks models, the first by using its post-condition and the second by using its pre-condition. In the case of WinNuke example, tcpscan (port tcp139 for windows netbios) may be indirectly correlated to WinNuke using the ontological rule since knows(A, use service(T arget Adress,0 netbios0 )) is unifiable with use service(H,0 netbios0 ) and use os(H,0 windows0 ) is unifiable with use os(T arget Host,0 windows0 ). An operator has then only to specify the different pre-conditions and post-conditions of the different elementary attacks that he can model since the different scenarios are generated automatically and the scenarios are constructed on line according to the different attacks that are detected by the different IDSs, deployed over the network, that have generated the corresponding alerts. With this specification, the operator is not charged to write all scenarios as in the explicit alert correlation where this task may be exhaustive since in real life there are numerous known scenarios. Ning et al. [Ning et al., 2002] proposed a similar approach as that of Cuppens, so we do not discuss this method here. Attacker intention recognition Another extension to this model suggested by Cuppens [Cuppens et al., 2002] is the recognition of the attacker intention. He proposed to extend the notion of correlation to correlate attacks with the attacker intrusion objectives. An intrusion objective is modeled by a system state condition that corresponds to a violation of the security policy. As an example, there is a violation of security policy when an sql server goes down. Figure 2.7 shows such an intrusion objective on an sql server. Counter measures and anti correlation While the Adele language proposes a static formalism for reacting against intrusions [Michel & M´e, 2001], LAMBDA is used with the same idea of a semi explicit correlation to react against intrusions. In [Cuppens et al., 2004, Cuppens et al., 2006], Cuppens et al. introduced the notion of anti-correlation for launching appropriate counter measures against ongoing attack scenarios. A counter-measure has similar attributes as those of an attack. While a detection attribute is associated with an

38

Different approaches of intrusion detection Objective State

sqlserver_failure(Host) deny_of_service(Host) -Deny of service on Host and server(Host,sql) -Host is an sql server

Figure 2.7: Intrusion objective: denial of service on an sql server. attack, an action is associated to the counter-measure. Definition of anti-correlation with LAMBDA Two actions A and B are anticorrelated when the realization of A has a negative influence over the realization of B (given that A occurred before B). More formally, if post(A) is the set of post-conditions of action A and pre(B) is the set of pre-conditions of action B, we say that A and B are directly anti-correlated if the following conditions are satisfied: ∃Ea and Eb such that: • (not(Ea ) ∈ post(A) ∧ Eb ∈ pre(B)) or (Ea ∈ post(A) ∧ not(Eb ) ∈ pre(B)) • Ea and Eb are unifiable through a most general unifier θ. To illustrate the different steps of CRIM ; particularly correlation, attacker intention recognition and anti correlation, let us take an example of a DDoS attack that we have modeled using LAMBDA in [Bouzida et al., 2006]. In fact, distributed denial of service attacks (DDoS ) are becoming a big threat to Internet. Recently, some DDoS attacks have infected more than 100, 000 vulnerable hosts over Internet within 10 minutes. In the following, we show the cooperative approach based on CRIM that can detect coordinated attack scenarios through alert correlation of distributed IDSs. The main architecture of classical DDoS attacks, based on flooding, is presented in Figure 2.8. The LAMBDA model corresponding to the global scenario is shown in Figure 2.9. We mention that we do not take into account the early steps of the DDoS tools that consist in scanning and compromising the different hosts that will later play the role of masters and slaves. These intrusions may be detected with some IDSs and then some corresponding scenarios may be constructed which will be detected with the correlation method discussed above. However, if the attacker has a legitimate physical or remote access to the different machines then these first steps cannot be detected. The scenario shown in Figure 2.9 corresponds to the different steps followed by an attacker, after compromising the different useful hosts, to launch a distributed denial of service against a victim. In reality, it corresponds to the scenario of activating well known DDoS tools such as Trinoo, Stacheldraht, TFN, Mstream, Shaft, etc. Once activated on the compromised hosts, these DDoS tools perform classical DoS attacks such as syn flooding or smurfing. The LAMBDA models for each elementary attack corresponding to the scenario is shown in Figure 2.10. The first step consists in opening a connection between the attacker with the master. The pre-condition of the first attack of the scenario mentions that the H host is a master host compromised by the attacker A. Its post-condition specifies that the attacker has opened a connection with the master.

2.4 Cooperative approaches

39

Attacker

Masters

Slaves

Attack traffic Control traffic Victim

Figure 2.8: DDoS Attack Architecture [CERT, 1999].

When the attacker launches (automatically or not) the daemon(s) on the different compromised slaves hosts, these slaves hosts will send a message called ”show alive message” to the masters that control them in order to inform their readiness to flood a victim. This action may be performed in parallel with the first step. In the third step, the attacker sends a dos command to the master to start a DDoS attack using the detected slave computers. Thereafter, the master sends a dos command to the slave in order to flood the desired victims. Once the CRIM engine has recognized the global scenario using the approach explained in the previous sections, one appropriate counter measure should be launched. It consists in killing the slave daemon process. In addition to this, the administrator where the daemon is located is warned about the host where the daemon is installed. He should find the vulnerability that permitted the installation of the daemon on that host and disinfect and patch the system. Without doing this, other daemons may be launched automatically from the same host. This is what is called counter counter-measure. We mention that the main goal in using correlation technique to recognize the first steps of an ongoing DDoS scenario is to react against it before the objective is reached; i.e. before the flooding is started where in this case it is difficult and too late to react. Despite the fact that the semi explicit technique simplifies the operator task in specifying and then writing the attacks scenarios according to a specified language, there is an open issue that remains without any answer. It consists in the difficulty to take into account the alerts that are generated from anomaly detection systems since the operator cannot model an elementary attack that is not yet known when he has specified the different elementary attacks.

40

Different approaches of intrusion detection

attack connection(A,H) pre: remote_access(A,H), master(H) post: connected(A,H)

attack show_alive(S,H) pre: remote_access(H,S), master(H),slave(S) post: knows(H,slave(S))

attack command_dos(A,H,V) pre: connected(A,H), master(H), knows(H,slave(S)) post: dos_command_sent(H,S,V)

counter-measure kill_slave(S) pre: slave(S) post: not(slave(S))

attack command_dos_to_slave(H,S,V) pre: knows(H,slave(S)), master(H), dos_command_sent(H,S,V) post: Distributed_denial_of_service(V)

correlation anti-correlation objective ddos(V) state: distributed_attack(V)

Figure 2.9: DDoS Attack correlation graph.

Implicit alert correlation The implicit alert correlation functionality is different from those of the explicit and semi explicit alert correlation. Indeed, the first two techniques try to recognize the attacker intention using the scenarios that are constructed a priori or during the reception of the corresponding alerts whereas the implicit correlation tries to group alerts using an implicit link between them. This implicit link is either retrieved by exploiting the different alerts generated by the different IDSs tools or by specifying an implicit link such as a similarity criterion as that used by Valdes and Skinner [Valdes & Skinner, 2001] presented in Section 2.4.1. One should mention here that the idea behind this technique does not deal with constructing scenarios or recognizing the attacker intention. However, Valdes et al. [Valdes & Skinner, 2001] proposed to relax the similarity expectation on the attack class in order to construct various steps in a multistage attack (i.e. scenario). Relaxing similarity expectation on attack class may lead to many false positive scenarios because many alerts coming from heterogeneous IDSs reporting alerts from the same host might refer to the same events and the result would be then wrong since two alerts corresponding to the same event might be considered as two consecutive steps of an attack scenario. In [Qin & Lee, 2003] Qin and Lee proposed another implicit correlation method based on time series causality test, particularly on the Granger Causality Test (GCT). The intuition behind the GCT is that if an event X is the cause of another event Y , then the event X should precede the event Y . However before applying alert correlation, there are two main steps that should be performed. The first consists in alert aggregation that permits to group the raw alerts generated by the heterogeneous sensors into hyper alerts. The second steps is called alert prioritization that consists in prioritizing each hyper alert based on its relevance to the mission goal whose objective is that, with alert priority rank, security analyst can select important alerts as the target alerts for further correlation and analysis. The priority score of an alert is computed based on the relevance of the alert to the configuration of the

2.4 Cooperative approaches

attack pre: detection:

post: verification: attack pre:

detection:

post: verification: counter-measure pre: action: post: verification: attack pre:

detection:

post: verification: attack pre:

detection:

post: verification: objective state

connection(A, H) remote access(A, H) ∧ master(H) classif ication(Alert,’master’) ∧ source(Alert, A) ∧ target(Alert, H) connected(A, H) true show alive(S, H) remote access(S, H) ∧ master(H) ∧ slave(S) classif ication(Alert,’show alive(S, H)’) ∧ source(Alert, S) ∧ target(Alert, H) knows(H, slave(S)) true kill slave(S) slave(S) kill slave(S) not(slave(S)) not(slave(S)) command dos(A, H, V ) connected(A, H) ∧ master(H) ∧ knows(H, slave(S)) classif ication(Alert,’command dos’) ∧ source(Alert, A) ∧ additional data(Alert,’ddos victim’, V ) ∧ target(Alert, H) dos command sent(H, S, V ) true command dos to slave(H, S, V ) knows(H, slave(S)) ∧ master(H) ∧ dos command sent(H, S, V ) classif ication(Alert,’command dos to slave’) ∧ source(Alert, H) ∧ additional data(Alert,’ddos victim’, V ) ∧ target(Alert, S) distributed denial of service(V ) unreachable(V ) ddos(V ) distributed denial of service(V )

41

– – – – – – –

attacker has a remote access on H H is a DDoS master host H the alert classification is ’master’ the source in alert is agent A the target in alert is host H attacker has opened a connection to the master H always true

– – – – – – – –

The slave S has a remote access on the master H H is a DDoS master host S is a DDoS slave host the alert classification is ’show alive(S, H)’ the source in alert is slave S the target in alert is master host H master H knows that host S is an alive slave always true

– – – –

H is a DDoS slave host kill the daemon of the running slave process H is not a DDoS slave host verify that there is no slave running on host S

– – – – – – – – –

attacker has opened a connection to the master H H is a DDoS master host master H knows that host S is an alive slave the alert classification is ’command dos’ the source in alert is agent A the victim to flood is V the target in alert is host H H sends a dos command S to flood V always true

– – – – – – – – –

master H knows that host S is an alive slave H is a DDoS master host master H sends a dos command to slave S the alert classification is ’command dos to slave’ the source in alert is Master host H the victim to flood is V the target in alert is agent host S distributed denial of service on V host V does not reply

– distributed denial of service on V

Figure 2.10: Modeling a DDoS scenario with LAMBDA.

42

Different approaches of intrusion detection

protected networks and hosts as well as the relevance of the severity of the corresponding attack assessed by the security analyst. Using the priority value and mission goals, the security analyst can specify a hyper alert as a target with which other alerts are correlated. The GCT algorithms is applied to the corresponding alert time series. Note that an operator has to specify the target alert in order to find out the causality link between alerts. In addition, all attributes of the alerts, particularly, the class name should be used to construct hyper alerts. Therefore, an anomaly IDS that generates alerts without classification could not be taken into account by this technique.

2.4.3

Cooperative architectures in intrusion detection

This last section presents briefly some architectures that are developed for cooperating different IDSs. In [Porras & Neumann, 1997], Porras and Neumann proposed a cooperative approach called EMERALD (Event Monitoring Enabling Responses to Anomalous Live Disturbances). EMERALD uses a hierarchically layered approach composed of three different layers and it is designed for tracking malicious activity through and across large networks. Each layer contains a set of agents. These agents are called monitors within this architecture. Each monitor has its own sensors that may use statistical methods or other rule methods. The three layers are: service layer (in the lowest level) that covers the misuse of individual component and network services within the boundary, domain layer that covers misuse across multiple services and components, and enterprise layer that covers coordinated misuse across multiple domains. The service monitors are deployed only within a domain. Domain monitors correlate intrusion reports disseminated by individual service monitors, providing a domain-wide perspective of malicious activity. The enterprise monitors, accordingly, correlate activity reports produced across the set of monitored domains in order to detect attacks over the whole monitored network. The EMERALD distributed analysis provides a global abstraction of the cooperative and distributed community of domains. The different agents may exchange information through a subscription-based communication scheme. The different events that are analyzed in this system come from varied sources. These sources may be audit events, SNMP traffic, application and software logs and any analysis results generated by the heterogeneous IDSs deployed over the different domains. In the EMERALD architecture two components are used to perform statistical and rule based detection. The statistical component called EMERALD’s profiler engine performs statistical profile based anomaly detection similar to NIDES [Javitz & Valdes, 1993] (the successor of IDES [Javitz & Valdes, 1991]) which is discussed in Chapter 3. EMERALD’s signature engine, on the other hand, employs a rule-coding scheme that is a variant of the expert system P-BEST [Lindqvist & Porras, 1999]. In [Cuppens, 2001a], Cuppens presented a cooperation module called CRIM7 for intrusion detection that includes the different aggregation, correlation, attacker intention recognition and reaction steps described in the above sections. Figure 2.11 shows the different elements and the main principles of this architecture. One possible distributed architecture that is developed for CRIM is described in [Garcia et al., 2004]. However, CRIM may be implemented in either a distributed or a centralized architecture. There are many other architectures that are developed for cooperating alerts where some of them have a centralized architecture such as DIDS [Snapp et al., 1991] and NADIR 7

CRIM for “Coop´eration et Reconnaissance d’Intentions Malveillantes”.

2.5 Limitations of the current IDSs

Alerts

Alert Clustering

43

Alert clusters

Alert Merging

Global alerts

Alert Correlation

Candidate plans

Intention Recognition

Alert Base Management Function Attack description database

Global Diagnosis

Reaction IDS

IDS

IDS

Figure 2.11: General architecture of CRIM. [Hochberg et al., 1993] and others use hierarchical approaches (like EMERALD) such as GrIDS [Staniford-Chen et al., 1996], and NetSTAT [Vigna & Kemmerer, 1999]. The other systems use alternative approaches such as AAFID [Spafford & Zamboni, 2000], Micael [Queiroz et al., 1999] and IDA [Asaka et al., 1999] that propose to use mobile agents to collect the different pieces of an attack.

2.5

Limitations of the current IDSs

Perfect security is an impractical goal. Effective security, on the other hand, requires security measures be deployed selectively and in a manner that balances security against other competing concerns. The main objectives that one should take into account when proposing an IDS is to build strong and flexible mechanisms that enable users to make tradeoff involving intrusion detection system tools. In fact, one can measure an IDS with its effectiveness, adaptability and extensibility. An IDS is effective if it generates high true positive rate and low false alarm rate. It is adaptable if it (1) can generalize its knowledge to detect new intrusions that are variations of known ones, (2) can be easily updated after new intrusions take place or (3) can detect new attacks that are not yet known. It is extensible if it can be customized according to new system or network environments. Current IDSs lack effectiveness. The old scenario based techniques use rules and patterns and the statistical techniques use some statistical measures. These patterns and measures correspond to the codified expert knowledge in security, operating system design and the corresponding intrusion detection techniques. This knowledge is mostly based on known vulnerabilities and sometimes may be incomplete or imprecise due to the complexity of the monitored system or network. They also lack adaptability. The different alerts generated by the current IDSs are classified as known attacks —those that are specified by experts after analyzing current

44

Different approaches of intrusion detection

vulnerabilities and current intrusion techniques—. Therefore, only known attacks would be detected. As an example, signature based IDSs cannot detect a slight modification of an attack even if it consists in one letter of the considered pattern. Introducing new techniques and ideas to detect novel (unknown and future) attacks is a challenging problem. Current IDSs lack extensibility. Customizing an intrusion detection technique in another environment is impossible since the different measures and considerations are specific to the first environment and configuration for which it was designed. For example, the immune based technique cannot be used to detect attacks in the network since they are designed to their corresponding operating system in which they were developed. Recent research in intrusion detection has brought new advances to cope with some of these weaknesses. Some of them consist in aggregating and correlating alerts that are extremely useful since most of the IDSs used to conduct the different experiments generate high rates of false positive alarms. Of course, using alert correlation new scenarios may be detected. However, new attacks are not detected by the different sensors therefore alert correlation will fail to detect novel attacks so far.

2.6

Summary

In this chapter, we presented different techniques for intrusion detection that are investigated during these last three decades. Our presentation on anomaly intrusion detection that was first introduced by Anderson and then enhanced by Denning was limited to the description of the statistical approach proposed by Denning. On the other hand, two other anomaly based techniques, namely immune systems and policy based, are discussed in depth and criticized. Scenario based techniques that were investigated during the 90’s to detect some known vulnerabilities in the UNIX operating system are described. We slightly improved one of these models. Most current and highly used network intrusion detection tools are signature based ones. These tools use pattern matching techniques and fail to detect new attacks. In the last part of this chapter, we thoroughly presented the different cooperative approaches, that are investigated by many researchers during the last few years. These methods are using alert aggregation and correlation. However, the different elementary alerts that are generated by the different tools only detect known attacks. Therefore, a new attack scenario composed of some new elementary attacks would never be detected by these techniques. This is the reason why in chapter 5 we present an alternative technique in order to detect new anomalies, particularly from the network traffic. Other anomaly techniques will be presented in the following chapters since we compare our proposed methods to these techniques. In fact, the following chapter will discuss our first anomaly detection we have introduced to model user behavior using an information theory technique called principal component analysis.

Chapter 3

EigenProfiles to Intrusion detection In this chapter, we introduce an eigenprofile approach [Bouzida & Gombault, 2003b], based on Principal Component Analysis (P CA), for anomaly intrusion detection. We first investigate some related work, namely Hyperview-neural network component for intrusion detection and IDES statistical anomaly detector, that observe behavior on a monitored computer system and adaptively learn what is normal for individual users then monitors future observed behavior and identifies it as a potential intrusion if it deviates significantly from the expected behavior. Then, we point out some difficulties and disadvantages in deploying these techniques in real world. We give a thorough description of the multivariate principal component analysis. We describe our algorithm based on P CA. This approach functions by projecting users’ profiles onto a feature space that spans the significant variations among known user profiles. The significant features are known as eigenprofiles because they are the eigenvectors (principal components) of the set of user profiles. The projection operation characterizes a user profile by a weighted sum of the eigenprofiles features, so as to detect whether a user profile is anomalous, it is sufficient to compare its weights to those of known user profiles.

3.1

Related work

We discuss two different intrusion detection systems; the IDES that was first introduced by Denning et al. in [Denning & Neumann, 1985] who has set the different foundations of this tool and then implemented as a statistical model for anomaly detection. This latter is combined with an expert system for misuse detection to detect known attacks whose corresponding vulnerabilities are a priori known. Hyperview [Debar et al., 1992], on the other hand, shares some common features with IDES in a sense where it has also two components. The first is a neural network for user behavior anomaly detection and the second is an expert system that is also used for detecting known attacks. These tools differ in the method used for detecting behavior deviations but share some common drawbacks that consist in considering each user behavior independently from others.

3.1.1

SRI IDES statistical Anomaly detection tool

The SRI IDES [Javitz & Valdes, 1991] is a real time anomaly intrusion detection that is an implementation result of the anomaly model introduced by Denning (discussed in Section 2.2.1). It has two components. The first component is an expert system that flags the observed behavior as anomalous if one rule in the expert system is triggered. The second

46

EigenProfiles to Intrusion detection

component is a statistical model that learns adaptively the normal behavior and distinguishes between normal and abnormal behavior by using a multivariate statistical engine discussed in the following paragraphs. The IDES statistical anomaly detector maintains a set of normal behaviors that, in reality, are represented by profiles. A profile is a description of a normal behavior according to a set of intrusion detection measures. These profiles require to decode a minimum amount of historical data to be decoded and interpreted during the detection phase. Since there are many measures that can be useful for detecting intrusions, the profiles keep only significant statistics such as frequency tables, means and covariances. The process used by IDES in determining the nature of the observed behavior is based on statistics which are controlled by dynamically adjustable parameters many which are specific to each considered subject. A vector of intrusion detection variables is assigned to an audited activity corresponding to the measures recorded in the profiles. Some measures may be inhibited, others may not, depending on their usefulness in the target system. The relevant profiles that are stored in the knowledge base are checked each time an audit record arrives. These profiles are compared with the recently observed vector of the intrusion detection variables. A new record is considered anomalous if the vector of intrusion detection variables is sufficiently far from the point defined by the expected values stored in the profiles, with respect to the historical covariances for the variables stored in the profiles. Therefore an audit record is considered anomalous not by comparing audit variables independently but also by using the correlation between variables. Hence, IDES evaluates the total usage pattern, not just how a subject behaves with respect to each measure considered separately. An aging method, generally of thirty days, is used in order to update the knowledge database. This aging method creates in reality a moving time window for the profile data, so that the expected behavior is influenced, most strongly, by the most recently observed behavior. This leads IDES to adaptively learns subject’s behavior patterns as subjects change their habits. The following paragraphs summarize the IDES statistical anomaly detector: An IDES score value (denoted IS) is generated each time an audit record occurs. It reflects the degree to which recent behavior is similar to historical profile. There are two types of individual measures. The first is categorical (such as CPU time, I/O counts etc.) and the second is ordinal (such as the names of files accessed or the names of machines used to log in). IDES [Javitz & Valdes, 1991] considers two main measures (Q and S) for each individual measure where each S statistic is a transformation of a basic statistic Q. For example, if S represents the degree of abnormality of a recent CPU usage then the corresponding Q is a measure of how much the CPU is used in the recent past. In fact, each S measure indicates whether a Q value associated with the current audit record and its near past is unlikely or not. By observing the values of Q over many audit records and by selecting appropriate intervals for categorizing Q values, we could build a frequency distribution for Q. If we consider the number of files a user accesses each day, we might find the following: • 2% of the Q values are in the interval of 0 − 1 accesses • 28% are in the interval of 1 − 2

3.1 Related work

47

• 44% are in the interval of 2 − 4 • 23% are in the interval of 4 − 8 • 3% are in the interval of 8 − max The S statistic should be a large positive value when the value of Q is in interval 0 to 1 file accesses (because this is an unusual value of Q) and would be close to zero when the Q value is in interval 2 to 4 (because this is the most frequently encountered interval). IDES uses 16 intervals1 for each Q measure with interval spacing determined dynamically for each user. The last interval is not upper bounded so that all values of Q belong to some interval. If we sum the different probabilities by an ascending order, over the above example, we find: • P1 = 2% • P2 = 2% + 3% = 5% • P3 = 5% + 23% = 28% • P4 = 28% + 28% = 56% • P5 = 56% + 44% = 100% For each probability value Pi , the system determines the value Si such that the probability for a normally distributed variable, with mean 0 and variance 1, is larger than Si in absolute value and equals Pi . The corresponding value satisfies the equation P rob(|N (0, 1)| ≥ Si ) = Pi , or Si = Φ−1 (1 − P2i ) where Φ is the cumulative distribution function of the normal distribution N (0, 1) variable. For the above example, we obtain by using the normal distribution table: • S1 = 2.34 • S2 = 1.96 • S3 = 1.08 • S4 = 0.59 • S5 = 0 After processing a new audit record, the corresponding Qnew value would be in one of the 5 ascending ordered intervals. Therefore, for the above example: • Qnew is in interval 0 − 1, then the corresponding measure Si = 2.34 • Qnew is in interval 1 − 2, then Si = 0.59 • Qnew is in interval 2 − 4, then Si = 0 • Qnew is in interval 4 − 8, then Si = 1.08 1

We considered 5 intervals for simplicity.

48

EigenProfiles to Intrusion detection • Qnew is in interval 8 − max, then Si = 1.96

Since the IS statistic is a summary judgement of the abnormality of many measures then it may be calculated as the following IS = (S1 S2 . . . Sn )C −1 (S1 S2 . . . Sn ) where C −1 is the inverse of the correlation matrix of the vector (S1 S2 . . . Sn ), Cik denotes the correlation between two individual measures Si and Sk , Cii = 1.0 and n represents the number of the measures that are considered to monitor the information system. If all n different measures were independent, the correlation matrix would be identity and IS would be the sum of the squares of the measures; i.e S12 + S22 + . . . + Sn2 . An audit record would be declared to be abnormal when IS exceeded an appropriate threshold. The IS statistic value does not tell the security officer which measures are contributing the most to the decision that a behavior is abnormal. It tells only the summary judgement that the considered behavior is abnormal. Although the IDES statistical approach pays attention to the correlation between variables of the same subject, it does not take into account the correlation between different audit records of different users. This weakness may not allow the IDES statistical approach find existing relations between the different behaviors of different users. Therefore, it is difficult to detect masqueraders [Anderson, 1980] using the anomaly component of the IDES tool. This is why in the Section 3.3 we investigate an information theory technique based on principal component analysis (PCA) that may be used to adaptively learn behaviors and detects not only deviations from normal behaviors but also permits to detect masqueraders, those users that may seize other users’ authentication information and penetrate the system.

3.1.2

Hyperview—a neural network component for intrusion detection

Introduction Hyperview [Debar et al., 1992] is an intrusion detection system that includes two principal components. The first component is an expert system that is inspired from IDES. This expert system has a knowledge base that contains a set of known intrusion scenarios. The second component is a neural based anomaly component that adaptively learns the user’s behavior and raises alarms when significant variations are noticed in the audit trail. The designers of the system notes that the audit trail could emanate from a number of different sources, with different levels of detail. For instance; the keyboard level -the system observes every keystroke made by the user, the command level- the system records every command issued by a user, the session level- the system aggregates several commands issued from the time of login to the system to the time of logout, and finally, group level -where several users are grouped together and treated as a class of known users. The authors then note that the more detailed data made available to the intrusion detection system, the better the chance of the system being able to correctly raise an alarm. However, the more data presented to the system the more problematic storage and processing becomes. The most aggregated level of data will not put such a strain on the intrusion detection system. For the purpose of Hyperview, the authors decided to provide the system with an audit trail on the command level.

3.1 Related work

49

Hypotheses about user behavior and the audit trail The decision to attempt to employ a neural network for the statistical anomaly detection function of the system stems from a number of hypotheses about what the audit trail will contain. The fundamental hypothesis is that the audit trail constitutes a multivariate time series, where the user constitutes a dynamic process that emits a sequentially ordered series of events. The audit record that represents such an event consists of variables of two types. For the first type, the different values may be chosen from a finite set of discrete values —for instance the name of terminal the command was issued. For the second type, continuous values are considered —for instance CPU usage. The more detailed hypotheses that follow from the fundamental hypothesis are: 1. The user submits commands to accomplish a given task. These commands will be consistent over time, as the user requires preferences vis-`a-vis which way the task should be performed. Among tasks, the actions of the user will be less predictable, or even unpredictable. Thus, we will observe patterns of usage in the audit trail, as quasi-stationary sequences, interspersed with periods of non-stationary activity, 2. The preferred behavior of the user follows a stochastic law, the audit trail belonging to which, is a projection of this law onto the variables of the audit record in question. The audit trail can thus be viewed as a set of samples of the quasi-stationary process. The authors note that it is difficult to express a law from asset of samples, even when the underlying process is quasi-stationary. This law will instead be treated as a black box, and it will be approximated by the neural network, without ever having been made explicit, 3. There are correlations between the various measures contained in the audit record. This is a common sense hypothesis, since there would for instance –always by necessity– be an effect on, for instance, cache hit ratio, with increased CPU usage. Since the authors do not make the parameters of the user model explicit they cannot express these correlations. The proposed neural network component must be able to take advantage of these correlations during the learning process. The neural network component As a first step, the authors proposed an approach of mapping the time series to the inputs of the neural network. At the time, the researched approach was to map N inputs to a window of time series data, shifting the window by one between evaluations of the network. The authors acknowledged that this would make for a simple model, easily trained. However, there would be a number of problems with this method: 1. N is completely static, if the value of N were to change, a complete retraining of the network would be required, 2. If N was not adequately chosen, the performance of the system would be dramatically reduced. Too low a value of N , and the prediction would lack accuracy because of a lack of older relevant information, too high a value of N and the prediction would be perturbed by irrelevant information 3. During the quasi-stationary periods of the usage, a large value of N would be preferred, to encompass this quasi-stationary process. During the transition periods, on the other

50

EigenProfiles to Intrusion detection hand, where the older data has no meaning, the value of N should be as small as possible, to eliminate this irrelevant data quickly.

The authors then go on to state the correlations between input patterns are not taken into account with this model, since these types of networks learn to recognize fixed patterns in the input and nothing else. Other disadvantages are that they are slow to converge and the adaptability is low since partial retraining can lead to a network that forgets everything it has learned before. Instead the designers of Hyperview choose to employ a recurrent network [Connor & Atlas, 1991], where part of the output network is connected to the input network, as input for the next stage. This creates an internal memory in the network. Between evaluations, the time series data is fed to the network one datum at a time, instead of a shifting time window, the object of the latter being the same, namely to provide the network with a perception of the past. It is interesting to note that the recurrent network has long term memory about the parameters of the process in the form of the weights of the connections in the network, and short term memory about the sequence under study, in form of the activation of the neurons. These kinds of networks were at the time of the design much less studied than the recurrent ones. Experimental Results The designers put the neural network component of the system –the only part that was fully functional at the time of publication– to the test by feeding it an audit trail submitted by an anonymous user on a SUN3 UNIX work station. They used the accounting files as the source of the audit data, where each record contains the name of the command, the amount of CPU and core memory used, and the number of input/output performed. The beginning, and end, for each session was discernible from the audit trail. The first experiment considered the input as an endless continuous sequential stream of events. The artificial neural network was given each audit record sequentially, and asked to predict the next command in the sequence. When the next one was presented the network was retained to reflect the new discovery. The commands, of which 60 different are given in the audit trail, were mapped onto one output neuron each, the optimal result being 1 neuron with a numeric value of 1.0, and the other neurons with a value of 0.0. Three important parameters define the success of the network’s performance: 1. Confidence The maximum activation is numerically large, and there exists a convincing difference to the second highest activation. If the prediction is correct, this is an ideal state of affairs, and very troublesome one if the prediction is, in fact, false. Then the network is overconfident in its ability to predict the correct behavior. 2. Uncertainty The largest activation is very low. The output of the neurons are in the same range, the network cannot discriminate from what it knows, to propose the next command. This is either from a lack of example, i.e. this time series has not been seen before, or from an overabundance of choices the time series could mean one of possibly many things. 3. Conflict The largest activation is somewhere in the middle, and the difference to the second largest is too small. That means that either of the commands could be considered likely candidates, and the output of the network is only an indication of which is the more likely.

3.2 Principal component analysis

51

Debar et al. [Debar et al., 1992] observed a sequence of 6550 commands, trained the network on half of that sequence, and then fed the network with the entire sequence. The results looked quite promising. Correctly predicted commands had a high degree of confidence and the farther away from the correct prediction the output of the artificial neural network was, the lower the confidence. When looking closer at the results it became evident that some types of commands were often predicted in error, for instance the date command, that displays the current time and date. The network had learned however to classify this as an irrelevant command, not worth considering for inclusion in the user profile. They reported that such commands could be characterized as noise in the deterministic sequence. Other commands, such as those issued when dealing with a prototype of a database system (that crashed often, and at random intervals), were marked as very indicative of the usage of that particular user. The network also managed to automatically associate commands with similar actions, such as sh, and csh, often predicting a sh for csh and vice versa. The authors left it to the neural network control expert system to decide that “errors” like these were in fact not indicative of a security violation, but instead a more benign kind. Using neural networks was an interesting idea for learning user profiles in anomaly intrusion detection. However, the manner with which it was used in Hyperview generates some shortcomings because each user is assigned a neural network, so it is not possible to compare two networks to tell whether the users they represent have closer or totally different behaviors. This is one of the motivations for our investigation of an information theory technique presented in Section 3.3 that somehow permits to solve such a disadvantage.

3.2 3.2.1

Principal component analysis Introduction

Principal Component Analysis (PCA) is one of the most known and used multivariate techniques. It draws its origin from the work of Pearson (1901) and Hotelling (1933) [Hotelling, 1933]. It was not widely used until the sixties when the computer rendered the mathematical computations less time consuming. The principal components are linear combinations of random or statistical variables which have special properties in terms of variances. For example, the first principal component is the normalized linear combination (i.e., the sum of squares of the coefficients being one) with maximum variance. In effect, transforming the original vector variable to the vector of principal components amounts to a rotation of coordinate axes to a new coordinate system that has inherent statistical properties. The principal components turn out to be the characteristic vectors of the covariance matrix. Thus the study of principal components can be considered as putting into statistical terms the usual developments of characteristic vectors. For the point of view of statistical theory, the set of principal components yields a convenient set of coordinates, and the accompanying variances of the components characterize their statistical properties [Anderson, 1974]. In statistical practice, this method is used to find the linear combinations with large variance. In many exploratory studies the number of variables under consideration is too large to handle. Since it is the deviations in this studies which are of interest, a way of reducing the number of variables to be treated consists in discarding the linear combinations which have small variances and study only those with large variances. PCA has proven to be an exceedingly popular technique for dimensionality reduction

52

EigenProfiles to Intrusion detection

and is discussed at length in most texts on multivariate analysis. Its many application areas include data compression [Kirby & Sirovich, 1990], image analysis, visualization, pattern recognition [Turk & Pentland, 1991] and time series prediction. In the following, we present the basic principles of principal component analysis and the mathematical foundations behind finding the principal components that are the eigenvectors corresponding to highest eigenvalues of the covariance matrix of the different variables.

3.2.2

Basic principles

Principal component analysis involves many steps in order to transform a set of correlated variables into a number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Like other transformation techniques, such as Fourrier transform, PCA transforms data into another representation where new variables are considered as the basis of this new presentation. This transformation is different from other transformations since the generated basis vectors are not constant but are dependent on the original data that are subject to transformation. Transformation techniques are used to enhance some aspects in the data sets, PCA is also used for such an enhancement. While in the Fourrier Transformation, this deals with the frequencies aspect, in PCA the variance in the data set is considered. PCA is a linear transformation and the new basis vectors are subject to an orthonormality constraint. If we denote ei these new basis vectors then  1 if i = j (3.1) eTi ej = δij = 0 if i 6= j Since PCA is a linear transformation with linear orthonormal basis vectors, it can be expressed by a translation and rotation. If we consider the input data x and the output data y obtained after applying PCA, then the transformation may be expressed as the following: y = B(x − µx ) where B is the new base vectors matrix (i.e B = [e1 e2 . . . en ]T ) and µx = standard mean of the input data.

(3.2) 1 n

Pn

i=1 xi

is the

If we consider the two dimensional case then Figure 3.1 illustrates the basic principle of this transformation. Figure (a) presents the initial data set as it is introduced in its initial form. Each sample i is denoted xi = [xi1 , xi2 ]T . In Figure (b), the ith sample is denoted yi = [yi1 , yi2 ]T and calculated using Equation (3.2). The first two figures illustrate the transformation of an initial set of points into another representation where the main portion of the variance is stored in the first variable Y1 . This means that if we ignore the second variable, as in figure (c), the main variance of the data is kept. Most of the time information corresponds to variance and vice versa, thus representing an initial data set with a new more compact space keeping much of the variance of the data in the new compact representation offers many facilities to interpret the data in a new reduced space. The illustrative example above uses a reduction from two dimensions into one dimension, i.e only one variable is ignored which might not be significant. However, in reality the

3.2 Principal component analysis

53

Figure 3.1: PCA basic principle. reduction might be performed over hundreds or thousands of variables into only 2 to 3 variables as shown by the different experiments we conduct in the following paragraphs and in the next chapters. Therefore PCA is used not only for interpretation but also for space reduction where many variables are considered in the initial data set. The most used transformation for generating the new compact space having less dimension axes is that of Hotelling [Hotelling, 1933] where for a set of N observed d−dimensional data line vectors vi,i∈{1,..,N } , the q principal axes uj,j∈{1,..,q} , are those orthonormal axes onto which the retained variance under projection is maximal. It can be shown that the vectors uj are given by the q dominant eigenvectors (i.e. those with the largest associated eigenvalues) of the simple covariance matrix C=

X (vi − v)(vi − v)T i

N

(3.3)

such that Cuj = λj uj

(3.4)

and where v is the simple mean and λj the eigenvalue corresponding to the eigenvector uj . The vector ui = uT (vi − v), where u = {u1 , u2 , . . . , uq }, is thus a q − dimensional reduced representation of the observed vector vi . Assume we are given N observed data in a d − dimensional space Rd as represented in Figure 3.2. To determine the first component, we begin by finding the first axis X1 that passes through the origin O, and that adjusts to maximum the cloud of the whole observed data points. Note that u is a basis vector that satisfies Equation (3.1). The observations vi are now projected with respect to the new axis X1 . The most used criterion for adjusting the different considered points is based on the mean square method. So we should minimize the distances’ squares sum of the different

54

EigenProfiles to Intrusion detection Rd

vi

X1

hi

u O

Figure 3.2: Projection on a new axis (X1 ). points vi with their corresponding projection hi : N X

2

vi hi =

i=1

N X

2

Ovi −

Ohi

2

(3.5)

i=1

i=1

This is equivalent to maximizing the value of

N X

PN

i=1 Ovi

2

where

2

Ovi = (V u)T V u = uT VT V u

(3.6)

VN ×d is the matrix representing the observed points where each line i corresponds to a data point vi . To find out the value of u, one should maximize the quadratic value uT VT V u subject to the constraint uT u = 1, as defined in Equation (3.1). Let u1 be the vector that satisfies this maximum (see Equation (3.7)).   M aximize(uT V T V u) s.t.  T u u=1

(3.7)

The Lagrangian associated to Equation (3.7) is then written in Equation (3.8), where λ is the corresponding Lagrange multiplier [Arfken, 1985]: L(u, λ) = uT V T V u − λ(uT u − 1)

(3.8)

The extremum condition that satisfies Equation (3.8) is the solution to Equation (3.9). ∂L(u, λ) = 2V T V u − 2λu = 0 ∂u

(3.9)

(V T V )u = λu

(3.10)

Therefore, we get

3.2 Principal component analysis

55

We notice that u1 is an eigenvector of the matrix V T V corresponding to the eigenvalue λ. By multiplying the two members of Equation (3.10) by uT , we obtain Equation (3.11). uT V T V u = λ

(3.11)

Therefore, the maximum in question corresponds to an eigenvalue of the matrix (V T V ) and u1 is the eigenvector corresponding to the highest eigenvalue λ1 of this symmetric matrix. The two−dimensional subspace that best adjusts the data set cloud necessarily includes the subspace generated by u1 . We should then find a new vector u2 , which is orthogonal to u1 , as the second base vector of the new subspace. By using the same reasoning, we come again to resolve an optimization problem with two constraints as described in Equation (3.12).  M aximize(uT V T V u)    s.t. (3.12) uT u = 1    T u u1 = 0 The Lagrangian associated to the problem described in Equation(3.12), is therefore shown in Equation (3.13) where λ1 and λ2 are two Langrange Multipliers. L(u, λ1 , λ2 ) = uT V T V u − λ1 (uT u − 1) − λ2 uT u1

(3.13)

Let u2 be the solution that realizes this maximum. The extremum condition for u2 is therefore ∂L(u, λ1 , λ2 ) = 2V T V u − 2λ1 u − λ2 u1 = 0 (3.14) ∂u By multiplying Equation (3.14) by uT1 , we obtain λ2 = 0. We get from Equation (3.14) (V T V )u = λ2 u

(3.15)

Therefore, the second maximum in question corresponds to an eigenvalue of the matrix (V T V ) and u2 is the eigenvector corresponding to the second highest eigenvalue λ2 of this symmetric matrix. The first axis X1 is called the first factorial axis and the second axis X2 generated by the basis vector u2 is called the second factorial axis. This result may be extended to a subspace in Rd of dimension q < d. This subspace is thus generated by the q eigenvectors that correspond to the highest eigenvalues of the symmetric matrix (V T V ). Since the new subspace is generated by the corresponding eigenvectors, the input data sets can be transformed. The dimension of the subspace is the same as that of the input data so nothing is gained so far in terms of data reduction. However, it is desired in practice to reduce the subspace as much as possible. Generally the task of how much the dimensionality can be reduced is a matter of representing as much information as possible in as small a space as possible. In other words, to determine how many eigenvectors to ignore is a tradeoff between the wanted low dimension and the unwanted information loss. Since the ith eigenvalue is, by definition, equal to the variance of the ith variable and while we consider λi ≥ λi+1 , then this tradeoff may be defined as the inertia Pq0

Iq0 = Pdi=1

λi

i=1 λi

.100%

(3.16)

where q 0 represents the number of axes considered in the new subspace, d denotes the P0 dimension of the input data. The quantity denoted by qi=1 λi is called inertia explained

56

EigenProfiles to Intrusion detection

0 0 by the subspace Pd generated by the first q eigenvectors corresponding to the first q highest eigenvalues, i=1 λi is the total inertia (variance) of the initial input data and Iq0 represents the percentage of information that is kept after transformation that corresponds to the inertia ratio explained by the new subspace.

In practice, the number of the principal factorial components chosen depends on the precision we wish to reach. In general, we can limit to 2, 3 or 4 considered principal factors. The projection of the input data set onto the first two factorial axes compose an important information part included of initial data set and generally contain between 80% to 90% [Jolliffe, 2002] of information. If it is not the case, we can continue projecting on new axes. From the above analysis, principal component analysis (PCA) is thus considered as a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components [Jolliffe, 2002]. The objective of principal component analysis is to reduce the dimensionality (number of variables) of the data set but retain most of the original variability in the data. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. In the following sections, we first derive some computationally feasible formula to find the eigenprofiles (eigenvectors in mathematical terms), and then we describe the intrusion detection algorithm based on these eigenprofiles. The different experimental results using this algorithm are described in the last section.

3.3

The eigenprofiles approach

Much of the previous work on anomaly intrusion detection has ignored the issue of the selection of measures of the user profile and/or application behavior stimulus. This suggested us that an information theoretic approach to coding and decoding user behaviors may give new information about the user behaviors, emphasizing the most significant features. Such features may or may not be directly related to the actual used metrics such as CPU consumed time, the number of commands passed by the user during a login session, the number of each event type audited during an interval of time, etc. In the language of information theory, we want to extract the relevant information in a user profile, encode it as efficiently as possible, and compare one behavior encoding with a database of user behaviors encoded similarly. A simple approach to extracting the information contained in a profile is to somehow capture the variation in a collection of user behaviors and use this information to encode and compare user behavior profiles. In mathematical terms, we wish to find the principal components of the distribution of the behaviors, or the eigenvectors of the covariance matrix of the set of user profiles, treating a behavior as a point (or vector) in a space of a dimension equal to the number of the different metrics used. The eigenvectors are ordered, each one accounting for a different amount of the variation among the user behaviors. These eigenvectors can be thought of as a set of features that together characterize the variation between user behaviors. Each behavior location contributes more or less to each eigenvector which we call eigenprofile. Each user profile can be represented exactly in terms of linear combination of the eigenprofiles. Each profile can also be approximated using only the best eigenprofiles, those that have the largest eigenvalues, and which therefore account for the most variance

3.4 Different steps of the method

57

within the set of user profiles. The best N eigenprofiles span an N dimensional subprofile −00 prof ilespace”− of all possible profiles. In the first simulation experiments, a user normal profile, in our system, consists of a set of statistical measures such as that proposed by Denning [Denning, 1987] which is then enforced in the IDES tool (see Section 3.1.1) account of some numerically quantifiable aspect of observed behavior. For example: • the amount of CPU used, • the number of failed logins during an interval of time, • the number of password failures during a minute, • the number of files opened during a certain period, • the number of application executed during a session, • the number of different web pages visited in one day, etc. On the other hand, to apply this method in a real network with real users, we have installed Squid [Squid cache, 2005] which is a free high speed internet proxy-caching program, behind a firewall in front of 6 users’ machines. Then, we have audited these users during one month and performed our tests on the 42 Mbytes proxy web log file generated by squid during this period. The experimental results are discussed in Section 3.5. The set of the different measures are collected in an N − dimensional vector called a profile vector representing the profile of the user during the login session or a chosen interval of time, where N is the number of the statistical measures mentioned above. So if Γ is a profile that corresponds to a behavior of a certain user, then one can write:    Γ= 

m1 m2 .. .

    

(3.17)

mn where mi , i = 1, . . . , n correspond to the measures characterizing a user profile.

3.4

Different steps of the method

The general architecture of the proposed technique is described in Figure 3.3. The eigenprofiles approach has two main steps. The first step consists in learning the different user profiles with respect to the different considered measures. A statistical knowledge database, representing the different observed normal profiles during the first step, is then stored for further detection. The second step consists in observing a new profile then comparing it to the different correspond knowledge of the different profiles stored in the profiles database. We may summarize these two steps as the following: The initialization procedure is necessary and is made of the following steps:

58

EigenProfiles to Intrusion detection Audited profile

Projection over eigenprofiles space

Comparison

Known user (normal behavior)

Profiles database

Intruder (abnormal behavior)

Learning phase detection phase

Figure 3.3: The general architecture of the eigenprofile approach. • collect an initial set of user profiles (regular users of the network for example, this set is called the training set), • calculate the eigenprofiles from the training set, keeping only M 0 profiles that correspond to the highest eigenvalues. These M 0 profiles define the profile space. As new users are experienced or new functions are assigned to the usual users, the eigenprofiles can be updated or recalculated • calculate the corresponding distribution in M 0 − dimensional weight space for each known user profile, by projecting their profiles onto the “prof ile space”. • for each class (user reference), define the reference feature vector. • define a threshold for each class, and • store all these measure in a database for further investigation. Having initialized the system, a classification process is necessary to detect whether a new observed behavior is normal or not. The following steps are then used to detect intrusions from the audit trail whenever a new profile is introduced: • calculate a set of weights based on the input profile and the M 0 eigenprofiles by projecting the input profile onto each eigenprofile, • determine whether the profile corresponds to one of the training set by checking if the profile is sufficiently close to “prof ile space” • if there is a profile in the training set closest to the input one, verify if the current user is the one he claims to be or a masquerader using the procedure presented in Section 3.4.2,

3.4 Different steps of the method

59

• (optional) if the same new profile is encountered several times, verify if it corresponds to an actual user of the network and if it is the case, calculate its characteristic weight and incorporate it into the known behaviors. Verifying if a new behavior corresponds really to a known user is an external procedure of this technique. In the following paragraphs we derive and develop from the different equations described in Section 3.2.2 our approach for defining the different profiles for an anomaly intrusion detection based on principal component analysis.

3.4.1

Initialization procedure

If we consider a system of N users where each user is audited N U times during different time periods. The total number of audited profiles is then M = N × N U . We consider the following profiles types for our approach. Average profile Let the training set of profile vectors be Γ1 , Γ2 , . . . , ΓM . The average profile Ψ of this set is defined as the center (centroid) of the different measures of the profiles used in the training data set. M 1 X Ψ= Γi (3.18) M i=1

Each profile differs from the average by: Φi = Γi − Ψ

(3.19)

Let us call Φi a translated profile. Calculating the eigenprofiles The eigenprofiles are the eigenvectors of the covariance matrix C, where M 1 X Φi ΦTi = AAT M

(3.20)

1 A(n×M ) = √ [Φ1 Φ2 . . . ΦM ] M

(3.21)

C(n×n) =

i=1

and

Let Uk be the k th eigenvector of C, λk the associated eigenvalue and U = [U1 U2 . . . UM ] the matrix of these eigenvectors (eigenprofiles). Then, according to Equations (3.1) and (3.4), we may write CUk = λk Uk

(3.22)

such that UkT Un =



1 if k = n 0 if k 6= n

(3.23)

60

EigenProfiles to Intrusion detection

We define the feature vector Ωi correponding to the ith  ω1  ω2  Ωi = U T × Φi =  .  ..

varaiable as     

(3.24)

ωM The weights ωi,i=1,...,M describe the contribution of each eigenprofile in representing the input profile, treating the eigenvectors as the basis set for user behaviors. This feature vector may be then used in a standard behavior classification algorithm to find which of a number of predefined behavior classes, if any, best describes the behavior. If the length of the profile vector is n (number of considered measures), the matrix C is n × n and determining the n eigenvalues and eigenvectors is an intractable task when considering hundreds of measures. In the following, we develop a procedure that decreases the calculation time. So, a computational feasible method is that, from Equations (3.20) and (3.22), one can obtain A AT Uk = λk Uk (3.25) AT A (AT Uk ) = λk (AT Uk )

(3.26)

Yk = AT Uk

(3.27)

AT A Yk = λk Yk

(3.28)

Let Then From Equation (3.28), Yk is the eigenvector of AT A and λk is its corresponding eigenvalue. Let Xk = αk Yk , so Xk is also an eigenvector of AT A. From Equation(3.27): XkT Xk = (αk AT Uk )T (αk AT Uk ) = αk2 λk UkT Uk

(3.29)

XkT Xk = αk2 λk

(3.30)

We have UkT U = 1 Then In order to obtain a normalized vector Xk (i.e. XkT Xk = 1), we shall have: 1 αk2 λk = 1 ⇒ αk = √ λk

(3.31)

From Equations (3.25) and (3.27), we have:

Uk =

A Yk = A AT Uk

(3.32)

A Yk = λk Uk

(3.33)

1 1 1 A Yk = A Xk = √ A Xk λk λk αk λk

(3.34)

Hence, 1 Uk = √ A Xk λk

(3.35)

3.4 Different steps of the method

61

With this analysis, the calculation is highly reduced, from the order of the number of measures n to the order of the number of profiles considered in the training set M , because Xk is an eigenvector of the matrix AT A having a reduced dimension M . We mention that most of the time, the number of measures that are considered for characterizing a behavior outnumbers the number of the different users considered for analysis; i.e. n >> M . In the case where this is not the case, the above analysis may not be useful since it is better to find out directly the vectors of the covariance matrix Cn×n . This case is possible, for example when the number of variables considered in the training data set is collected during a long period and where the different variables are expressed using a fixed attributes number. In chapter 4 for example, we consider a connection record as an observed variable and the different measures composing each variable is fixed a priori. The training data set considered consists of n = 494021 and the number of attributes characterizing each connection consists of 41 different discrete and numerical attributes transformed into m = 125 numerical variables.

Calculating the feature vectors in the new space Each profile is represented by a feature vector in the eigenprofiles space. A feature vector Ωi of a profile Γi is obtained by projecting its corresponding profile Φi onto the eigenprofiles space as described in Equation (3.24) where U = [U1 U2 . . . UM ]. Hence, each profile is represented by a set of M 0 points where M 0 < M . M 0 is chosen according to Equation (3.16) such that the new subspace is reduced while keeping much information. The weights describe the contribution of each eigenprofile in representing the input profile, treating the eigenvectors as the basis set for user behaviors. This feature vector may then be used in a standard behavior classification algorithm to find which of a number of predefined behavior classes, if any, best describes the behavior. The simplest method for determining which profile class provides the best description of an input profile vector is to find the eigenprofile class k that minimizes the Euclidian distance with the input profile described in the following paragraph.

Class Organization To each user corresponds a class composed of feature vectors obtained by projecting the N U user profiles onto the eigenprofiles space. Each class k is represented by a reference feature vector Ωk . The N U profile classes Ωk,k=1...N U are calculated by averaging, for each user, the results of the eigenprofile representations over the number of audited profiles (as few as one) that best represent the user behavior during a session or a pertinent interval of time, i.e: NU 1 X k Ω = Ωi NU k

(3.36)

i=1

Ωki is the ith feature vector of the k th user class. Let us call Ωk the mean feature vector of user k in the new feature space. A user behavior is classified as belonging to class k when the minimum k = kΩi − Ωk k2

(3.37)

62

EigenProfiles to Intrusion detection

is less than some chosen threshold θ (where kΩi − Ωk k2 represents the Euclidian distance between the average profile Ωk of class k and the feature vector Ωi of the input profile). Otherwise, the new behavior is considered as anomalous.

3.4.2

The detection and identification procedure

We summarize the eigenprofile identification and detection process we defined in our approach in the following steps: 1. audit a new user profile during a session (or an interval of time), 2. determine its corresponding translated profile Φi = Γi − Ψ, 3. project the translated profile Φi onto the eigenprofile space according to Equation (3.24). We then obtain its corresponding feature vector   ω1  ω2    Ωi = U T × Φi =  .   ..  ωM 0 where U is the eigenprofile matrix. 4. Define the k th class which minimizes the distance k = kΩi − Ωk k2 IF k ≤ θk THEN the new observed profile is that of the k th user ELSE the new observed profile is anomalous ENDIF The threshold θk is defined by experience. One simplest method consists in forcing its value to the highest distance between the farthest feature profile of the user from his mean feature vector described in Equation (3.36). θk = max (kΩi − Ωk k2 ) i=1...N U

(3.38)

If this threshold is set too high, the number of false negative alarms increase. If it is set too low, the number of false positive alarms increase. So the initialization (setting) of this parameter is a tradeoff between the high true positive alerts and the low false negative alerts desired by the site security officer and the security policy considered on the monitored system.

3.4.3

Summary of the eigenprofiles approach

To summarize, the eigenprofiles approach to intrusion detection involves the following steps: 1. collect a set of characteristic profiles vectors of the known users. This set should include a number of profiles, for each user, audited in different days and in an interval of time such as for example during the login session. These profiles should be as significant as possible and should be closest to the actual user behavior to well characterize it. (say for example four profiles for twenty users, so M = 80),

3.5 Experimental results

63

2. calculate the 80 × 80 matrix C, find its eigenvectors and eigenvalues, and choose the eigenvectors with the highest associated eigenvalues. (let M 0 = 50 for this example), 3. combine the normalized training set of the different profiles according to Equation (3.35) to produce (M 0 ) eigenprofiles Uk , 4. for each known user, calculate the class vector Ωk by averaging the profile vector calculated from the original (four) profiles of the user. Choose a threshold that defines the maximum allowable distance from any profile class, 5. for each new profile to classify, apply the classification algorithm presented in Section 3.4.2, 6. if a new profile is classified as a known user, this profile may be added to the original set of familiar profile vectors (steps 1-4). This gives the opportunity to modify the profile space as the detection system encounters more instances of known users. In our current system, the calculation of the eigenprofiles is done offline as part of the training.

3.5

Experimental results

In the following we show the robustness and the simplicity of the proposed approach towards the behavior classification. To illustrate the different steps that are performed following the eiegenprofile approach, we have considered in the first experiment a very simple example. This former may be seen as a small part of our system because it just considers the different occurrences of events and some commands. It consists of four kinds of users as those defined in [M´e, 1994]: the inexperienced user, the novice developer, the professional developer and the UNIX intensive user. The different measures used during this primary experience are the different occurrences of events generated by the different commands introduced by a user during an interval of time (30 minutes). In the following paragraph, we apply the different eigenprofile technique steps on these users’ behavior and show that eigenprofile technique really characterizes well the behavior of each user and this new method is very interesting in anomaly intrusion detection.

3.5.1

A simple data set

The different users’ profiles are described in Table 3.1 after translating the generated commands into AIX audit events (for more details, see for instance [M´e, 1994]) during the simulated session: 1. The different profiles generated from these characterized behaviors (the training set) are as provided in Table 3.1: Inexperien -ced user Γ1 user login fail user log (23h `a 6h) short session use su Ok use su fail

Novice developerΓ2

Professional developer Γ3

Unix intensive user Γ4

64

EigenProfiles to Intrusion detection who, w, finger. . . 2 3 4 more, pg, cat,. . . 1 3 3 ls OK 2 2 3 ls fail df, hostname, uname arp, netstat, ping ypcat lpr 4 1 1 1 rm, mv 4 ln whoami,id rexec, rlogin, rsh 1 2 proc Execute 16 18 55 17 Proc SetPetri file open fail 4 2 4 file open fail cp file open .netrc file read lpr file read passwd . . . file write passwd. . . fail file write cp ok 1 1 file unlink rm 4 file mode 1 Table 3.1: The different profiles generated by the different users.

2. the average eigenprofile defined by these profiles is the following calculated according to Equation (3.18): P ΨT = 14 4i=1 ΓTi = (0 0 0 0 0 2.25 1.75 1.75 0 0 0 0 1.75 0 0 0.75 26.5 0 2.5 0 0 0 0 0 0.5 1 0.25), This profile corresponds to the mean of the different profiles presented in Table 3.1. 3. the difference between each profile from the average profile (according to Equation (3.19)) for each user (Φ1 , Φ2 , Φ3 , Φ4 )) is then calculated, 4. calculate the matrix A29×4 according to Equation (3.21), 1 PM T T 5. calculate the eigenvectors of the covariance matrix C29×29 = M i=1 Φi Φi = AA that is given by Equation (3.20): this may be done by using the result defined in Equation [3.35)   37.15 20.03 −77.40 20.21  20.03 19.65 −60.28 20.59   AT A =   −77.40 −60.28 204.78 −67.09  20.21 20.59 −67.09 26.28

The eigenvalues λi,i=1...4 of this matrix are:

3.5 Experimental results

65

 273.792  12.249     1.833  0 

Their corresponding orthonormal eigenvectors are   −0.329   −0.254   , X2 (12.249) =  X1 (273.792) =    0.865  −0.282 

 −0.786 0.268   , and −0.038  0.556



 0.157  −0.783   X3 (1.830) =   0.026  0.601 1 PM T T The eigenvectors Uk,k=1,...,3 of the covariance matrix C = M i=1 Φi Φi = A A are obtained from Equation (3.35). In this example, we have considered only one profile for each user then this profile corresponds exactly to the class vector Ωk , k = 1, . . . , 4 since N = 4 from step 4 of the summary of eigenprofiles procedure. Table 3.2 presents the Euclidian distances (according to Equation (3.37)), between (one from another) the four users considered, in the new eigenprofiles space.

User1 User2 User3 User4

User1 0

User2 4.095 0

User3 19.927 18.577 0

User4 4.797 2.179 19.111 0

Table 3.2: Euclidian distances between (one from another) the four classes.

And if we try to represent the four profiles in the new eigenprofiles space, the representation looks like in the Figure 3.4 where U1 ,U2 and U3 are orthonormal axes. According to Equation (3.16), the inertia ratio generated by the first eigenvalue λ1 = 273.792 is 273.792 I1 = 273.792+12.249+1.833 .100% = 95.11%. This is why the first axis U1 is a discriminating axis since it corresponds to the highest eigenvalue λ1 = 273.792. Using only this axis, as shown in Figure 3.4, we may differentiate between all the user profiles. The previous paragraph describes a small application of this new method on a simple example, which shows that the eigenprofile intrusion detection method really classifies the behaviors of the different users, and really assigns a unique class to each user profile. The reader may simulate some behaviors according to the different events and then apply the classification by PCA to these new audited behaviors. He will certainly finds that this method is efficient. If the simulated behaviors do not differ a lot from those of the training set then the calculated Euclidian distance using Equation (3.37) is small. Otherwise these behaviors will be considered as anomalous; i.e. the Euclidian distance from the different known users is great. A threshold θ shall be chosen experimentally. We have not defined

66

EigenProfiles to Intrusion detection

Figure 3.4: The projection of the users’ profiles onto the new eigenprofiles space. this threshold, in this simple example, because we do not have many profiles of each user. The threshold is defined by experience, for example it is chosen equal to the distance between the farthest user profile of the training set from its average class added to a small value +.

3.5.2

A real data set experimentation

This section presents the tests set to perform realistic experiments and application of PCA on a real proxy web log file generated by squid. Many systems (see for instance [Za¨ıane et al., 1998]) have been designed to extract implicit knowledge from web log files but not any of them, in our knowledge, has answered the following questions: whether user behaviors change over time, and is it possible to characterize a user behavior using web log files and how? By characterizing the different users’ behavior, we can then distinguish network users and detect whether a new user is the one he claims to be. Principal Component Analysis, as presented in the previous sections, allows a good classification between subjects. For this reason, we use it in order to classify the known network users’ behavior and then, on one hand, test its ability to reject intruders and detect masqueraders [Anderson, 1980] who might have bypassed access control procedure, and on the other hand, measure its capacity of generalization for new behaviors of the authorized users. Each entry in the access log file has the following format (for more details, see [Squid cache, 2005]) [Timestamp] [ElapsedTime] [Client-IP] [Action]/[Code] [Size] [Method] [URI][Hierarchy]/[Content Type] where 1. Timestamp: The time when the request is completed, specified in seconds since January 1, 1970 with millisecond resolution, 2. Elapsed Time: this field notes the elapsed time of the request in milliseconds (i.e., the time between the accept() and close() of the proxy socket),

3.5 Experimental results

67

3. Client-IP: The IP address of the connecting client, 4. Action: This field describes how the request was treated locally at the proxy server (e.g., hit, miss), 5. Code: The status-code in the HTTP reply sent to the client in response to its request, 6. Size: This field records the number of bytes transferred from the proxy to the client for a particular document retrieval request, 7. Method: The HTTP method requested by the client, 8. URI : The requested URI, 9. Hierarchy: This field describes how and where the requested document was fetched, 10. Content: The type of the Object (from the HTTP reply header). If any of the above fields is not informed, it is replaced by an indent (“-”). Example : 985796546.734 1 192.168.xx.xx TCP MISS/200 207 GET http://good.sightpath.com/ images/smb-job-npm.gif - NONE/-image/gif Before performing the detection phase, the system should construct an information set required to complete this task. This is what we call ”The learning phase”. Its role consists of fixing the different parameters to provide best results. It is an important and a delicate task in our system. It is important because the system performance depends on its parameters. It is delicate because there are neither formulas nor direct methods which may be used to determine their values, we should calculate them by experimentation. To use PCA, we should first calculate the covariance matrix C (see Equation (3.20)). In our implementation, each entry in the profile vector determines the number of requests of a given URL during an audit session. A session consists of one day during the users’ activity period from 08h00 to 19h00. Thus, the length of a profile vector depends on the number of URLs visited by the different users of the training set. In our experimentation, the training set, which we call familiar set, is composed of three clients and the other three clients are called unfamiliar set. We have used 96 profiles audited during a month, i.e. 16 profiles for each user. The remaining set of profiles is not selected because it corresponds to inactivity days such as week ends and the days when these users are not present. At the beginning, we have divided this set of profiles into two classes. The first class, which is used for the training set, is composed of 24 profiles of the first three users representing their profiles during the first two weeks, that is 8 profiles per client. The 8 remaining profiles of the first three clients are used to test the generalization ability of the system (i.e. the unlearned profiles of the familiar set). Another subset of 48 unfamiliar profiles is used to evaluate the ability of the system to reject the unknown clients. We have considered the following ratios: Successful identification ratio (SIR): means the percentage of profiles which are successfully identified. Confusion Ratio (CR) or false negative: means the percentage of profiles which are not successfully identified and not rejected. Successful Reject Ratio (SRR) : means the percentage of unfamiliar profiles which are rejected. Failure Reject Ratio (FRR) or false positive: means the percentage of familiar profiles which are rejected. Table 3.3 shows the different results obtained during this first experimentation: These first results seem very poor and 29.17% of the learned subset were recognized as intruders, so it is not even interesting to test the unfamiliar subset. In reality, these false positives are due to the clients’ behavior changes over time. To overcome this disadvantage,

68

EigenProfiles to Intrusion detection

SIR FRR

Familiar set learned profiles unlearned profiles 17/24 (70,83%) 10/24 (41,67%) 7/24 (29,17%) 14/24 (58,33%)

Table 3.3: Performance of the system in the first experiments.

we have introduced another strategy which consists of auditing the users during a period. We have chosen this period equal to 4 successive days. We then introduce the profiles of the following day for the detection. We also have used the same first three users for the training set. Progressively, the training set is updated by considering only the past four days if the user’s profile during this period is recognized as normal. Let us call this strategy “regular learning”. This idea recalls Denning’s statistical approach [Denning, 1987]: “Given a metric for random variable x and n observations x1 , x2 , . . . , xn the purpose of a statistical model of x is to determine whether a new observation xn+1 is abnormal with respect to the previous observations”. However, in our case, each observation represents a user profile audited during a certain period (one day in our example). Table 3.4 shows the different results obtained using this second strategy.

SIR CR SRR FRR

familiar set learned profiles unlearned profiles 46/46 34/36 (100%) (94.44%) 0 0 0% 0% – – – – 0 2/36 (0%) (5,56%)

unfamiliar set – – 0 0% 48/48 100% – –

Table 3.4: Performance of the system in the second strategy.

We considered 48 familiar profiles and only 46 are used during the learning step because two of them are not used in the learning step since they were recognized as anomalous behavior. This is because our strategy consists in taking into account the last four profiles that are correctly predicted for the next training step. By using the regular training, the proposed method was successful and efficient enough to differentiate and classify new profiles. Hence, 34 new familiar profiles among 36 profiles were successfully identified and all unfamiliar profiles (48/48) correspond to the 3 foreign users are successfully rejected.

3.6

Summary

We first presented in this chapter the IDES and Hyperview intrusion detection tools. The IDES tool has pioneered the field of intrusion detection for one decade. It is actually an enforcement of the statistical approach proposed by Denning [Denning, 1987]. We noticed that along with the Hyperview tool, it presents some limitations particularly detecting for masqueraders. Then we investigated a flexible technique based on principal component

3.6 Summary

69

analysis that permits to learn and classify users based on the different measures considered as pertinent to describe users profiles in a monitored system. Principal components analysis is very important in system theory and the idea to use it for intrusion detection is new. The eigenprofiles approach studied a novel method to anomaly intrusion detection that permits, in our preliminary experiments, a good classification of the different unix-users and demonstrates its learning and generalization ability of users’ profiles using web log files. It presented a simple engineering solution to the problem using principal components analysis. The different results obtained on the different data sets are very interesting. In addition to these advantages, it may easily detect masqueraders. Moreover, if a user changes his workstation then his feature vector will be close to his true class. In this case, it is sufficient to verify the IP address of this class to define its corresponding user. Hence, this user is detected as a masquerader. The different techniques described in this chapter are interesting but have some inherent limitations in applicability to network environments by their focus on users as the analysis targets. This is why in the following chapters we investigate the application of principal component analysis with supervised techniques in order to detect network intrusions.

Chapter 4

Eigenconnections and supervised techniques to intrusion detection In this chapter we first give an overview of the data mining problem and show its straight relation with intrusion detection. We then situate our work according to the different steps that compose a knowledge discovery process. The different experiments we have conducted are mainly based on the classification techniques applied to intrusion detection. The first results are based on two supervised techniques namely the nearest neighbor and decision trees. They are reported in [Bouzida & Gombault, 2004] and showed the effectiveness of combining these methods with principal component analysis. In chapter 3, we presented an anomaly intrusion detection based on principal component analysis whereas we use it, in the following, as a reduction technique for misuse detection. The different experiments are performed on the reassembled connections from a network tcpdump traffic. We show the effectiveness of using the data reduction technique, which we call eigenconnections, over the connection records before applying a classification model for misuse detection. We assess the different results we obtain by comparing them to other techniques since we performed our experiments over the MIT DARPA98 data sets. Principal component analysis enables us to significantly reduce the information quantity in the different data sets without loss of information. This technique outperforms previous work in the time consumed for classification while generating highly successful classification rates.

4.1

Data mining and knowledge data discovery

The reason for the increasing interest in data mining techniques from many sectors ranging from industry, military and scientific academia is due to the wide availability of huge amounts of data and the necessity to turn this data into useful information for exploitable knowledge. Data collection is an easy task where many huge databases might be collected each day by many business companies, markets, etc. While data collection is a cheap and easy task, analyzing huge amounts of data remains expensive and difficult. Extracting useful information from high volumes of data is manually impractical where in the same time fast techniques are needed to promptly investigate the recently collected data for further analysis. This situation led many researchers to develop new analysis techniques for extracting knowledge from these large amounts of data. These techniques are referred to the field of data mining which is generally treated by many people as knowledge discovery in databases [Han & Kamber, 2001]. However, others view data mining as simply an essential step in the process of discovery in databases.

72

Eigenconnections and supervised techniques to intrusion detection

Fayyad et al. [Fayyad et al., 1996a] defined the knowledge discovery process as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data, and data mining as a particular step in this process where some specific algorithms are used for extracting patterns and models from data [Fayyad et al., 1996b]. The intrusion detection field shares the same characteristics as those of other fields ranging from marketing, military and industries where huge amounts of data are collected and provided everyday. In intrusion detection, huge volumes of network traffic are provided every hours where its analysis remains manually impractical. In the following, we first present the different steps of a knowledge discovery process. Then, we present our scheme on how to combine an information theory, for space reduction, as a middle step among the different steps composing the process of knowledge discovery.

4.1.1

Different steps for the knowledge discovery process

Knowledge discovery as a process is depicted in Figure 4.1 and consists of an iterative sequence of the following steps [Han & Kamber, 2001, Fayyad et al., 1996b]:

Interpretation/ Evaluation

Data Mining VISIO CORPORATION

Transformation

$

knowledge Preprocessing

Patterns Selection

Preprocessed data

Data

Transformed data

Target data

Figure 4.1: Data mining as a step in the process of knowledge discovery.

• Application domain knowledge: this step concerns the prior knowledge and the goal of the application that concerns developing an understanding of the application domain and the relevant prior knowledge and identifying the goal of the knowledge discovery process from the analyzed problem. • Creating a target data set: it consists in selecting a data set or focusing on a subset of variables or data samples on which discovery is to be performed. • Data cleaning and preprocessing: The basic operations of this step include removing noise if appropriate, collecting the necessary information to model or account

4.1 Data mining and knowledge data discovery

73

for noise, deciding on strategies for handling missing data fields, and accounting for time-sequence information and known changes. • Data transformation: includes finding useful features to represent the data, depending on the goal of the task, and using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data. • Data mining: corresponds to the essential process where some methods are applied in order to extract data patterns. First, by deciding on the model to use such as summarization, regression or clustering and then applying an appropriate algorithm to generate a set of classification trees or rules and frequent sequential patterns, etc. • Interpretation or knowledge presentation: includes interpreting the discovered patterns and possibly returning to any of the previous steps, as well as possible visualization of the extracted patterns, removing redundant or irrelevant patterns, and presenting the useful ones into terms understandable by users. • Using discovered knowledge: includes incorporating this knowledge into the performance system, taking actions based on the knowledge, or simply documenting it and reporting it to interested parties, as well as checking for and resolving potential conflicts with previously believed (or extracted) knowledge. Data mining is an essential step in the knowledge discovery process. It is considered as the most critical one [Han & Kamber, 2001] since most previous work focused on data mining algorithms. Other steps are equally if not more important for the successful application of the knowledge discovery in practice. The data mining step is necessary in almost all problems where discovering valuable patterns from huge databases is needed. This step is also useful in intrusion detection for automatically discovering known intrusions from huge amounts of data sets. These data sets are particularly collected from the monitored network under analysis. In the following section, we discuss the relevant steps, among the different ones of the knowledge discovery process, that are used for intrusion detection. We give our vision about intrusion detection in a way on how to use a reduction technique before applying any classification technique. The reduction technique is effective if the information loss in the reduced data is meaningless. It is also valuable when the number of measures considered in the intrusion detection process is highly representative.

4.1.2

Data mining algorithms for intrusion detection

Since data mining has seen a rapid enhancement during these last decades, numerous techniques are introduced and others are improved in order to find their benefits in many fields such as pattern and speech recognition, machine learning, data bases, etc. These techniques may be relevant to intrusion detection. Some of the model functions in data mining that may be both pertinent and fitting intrusion detection are: Classification: maps data into one of several predefined categories. These techniques try to build models that can classify items into one of the known categories to which it may fall. These models may be expressed in the form of decision trees, rules, etc. An example of its application in intrusion detection is to gather some normal and various abnormal activities for a user, application or network traffic, then use a classification algorithm to build a classifier that may label new unseen activities as belonging to the normal class or to one of known abnormal activities or as a new unseen activity.

74

Eigenconnections and supervised techniques to intrusion detection

Clustering: Unlike classification where the classes are predefined, the clustering techniques map a data item into one of several categorical classes (or clusters) in which the classes must be determined from the unlabeled training data set. Clusters are defined by finding natural groupings of data items based on similarity metrics or probability density models. An application of this model for intrusion detection is useful when the data is not predefined a priori; i.e we do not have any knowledge about the categories (normal or abnormal) of the different data sets that are subject for intrusion detection evaluation. Another hypothesis for this mapping to be successful, is that the analyzed data sets should contain much more normal items than abnormal ones (for further details, see Section 5.1.2). Link analysis: determines relations between fields in the database. The main goal of link analysis is to derive attribute correlation or interesting relationships among items in a given data set. While this technique is used to describe which items are commonly purchased with other items in grocery stores, in intrusion detection for example, certain privileged programs only access certain systems files in specific directories. These consistent behaviors should be included in normal usage profiles. Sequence analysis: models sequential patterns. The goal is to model the states of the process generating the sequence or to extract and report deviation and trends over times. This analysis may find what time-based sequences of audit events are frequently occurring. The frequent events may provide guidelines for incorporating temporal and statistical measures into intrusion detection models. For example, network probing attacks may suggest that many measures about hosts and service should be taken into account for building pertinent intrusion detection data sets. Because of the large amount of data, both in the number of records and the number of system or network features, a data reduction, as a step in the knowledge discovery process, is also relevant to intrusion detection. With the above knowledge about the different steps of the knowledge data discovery process and the relevant models and techniques to intrusion detection, we can refine the definition of the data mining process investigated in [Lee et al., 1999] as follows. The data mining algorithm for intrusion detection corresponds to: (1) a process of extracting valuable features from user, system or network activities, (2) followed by a reduction and a filtering technique for space dimensionality optimization of the different features extracted during the first step and (3) finally, applying a classification model to distinguish between what is normal and abnormal. This process is depicted in Figure 4.2 where the process of reduction may or may not be present in the KDD process for intrusion detection. The raw data consists in the different network traffic (packets) that are first captured from a network. This data is then summarized into connection records whose features and attributes are extracted from the captured binary network packets. These features correspond in reality to the different characteristics of the packets composing each connection such as duration, protocol, service, flag, etc. Other features are also calculated from the within-connection features. To do so, link analysis, as a model function in the data mining process, is used to construct new additional features calculated from the frequent patterns. Then a reduction technique which is used for fast classification analysis particularly if we deal with high speed networks and where accurate and real time counter measures are needed to stop the devastating result of attacks. Classification models are used automatically to learn the detection models for further detection. This process should be iterative since bad classification may be a result of a bad classifier or a bad feature extraction technique that, for example, does not extract all relevant features for intrusion detection.

4.1 Data mining and knowledge data discovery

75 Classification models

data reduction Reduced Data backforward

Feature extraction connection records

Raw events data

Figure 4.2: Knowledge data discovery process for intrusion detection. The reduction step is a challenging issue that we propose for building effective and fast intrusion detection models. In fact, this technique is very useful particularly for the new generation high speed networks where new protocols are emerging. These new protocols add many new features that should be considered by the intrusion detection tools. The reduction technique brings an effective solution for such new protocols. Our main research, in the following, focuses in building alternative classification models for intrusion detection. For an effective and fast intrusion detection phase, we have introduced a reduction technique before applying one of the classification models based on the principal component analysis. Since the first steps composing the knowledge data discovery for extracting relevant features from raw tcpdump traffic were performed in [Lee, 1999] and the different data sets are made highly available, we experiment different classification algorithms for intrusion detection. To do so, we formalize, in section 4.2 the classification function as a final step in the knowledge discovery process for intrusion detection. Finally, we give the different results using different classification algorithms we evaluated over the KDD 99 data sets. For each evaluation experiment we use a reduction technique, as depicted in Figure 4.2 over the KDD 99 data sets and then apply the corresponding classification model for intrusion detection. A rough application of the corresponding algorithm is also applied over the KDD 99 data sets and a comparison between the two results obtained respectively when using the reduction technique based, in our case, on principal component analysis and when applying directly the classification algorithm. An overview of the reduction technique based on PCA that we called eigenconnections is presented in Section 4.4. Section 4.5 gives a thorough description of the different KDD 99 [Hettich & Bay, 1999] intrusion detection cup data sets and the different steps that are taken to process tcpdump data into connection records which is done by the different tools of the MADAM ID intrusion detection tool [Lee & Stolfo, 2000] whose main goals are to mine patterns, construct features. The different classification models we applied with their corresponding results are presented in Section 4.3 and Section 4.6.

76

Eigenconnections and supervised techniques to intrusion detection

In the following section, we first give a definition for a classification model to misuse intrusion detection. We present the different algorithms discussed below combined with a reduction technique we called eigenconnections [Bouzida & Gombault, 2004] presented in Section 4.4.

4.2

Supervised Classification models for intrusion detection

Humans detect and identify faces, objects and any other shapes in a scene with little or no effort. They may easily detect intruders in an organization. However, developing a computational model for these tasks is difficult. Therefore, many automated applications have been developed for this kind of identification and classification, particularly in the field of pattern recognition. These classification tasks mainly build a classifier; i.e. a function that assigns a class to each item corresponding to the target object to be identified. More formally, let us consider a set of attributes A = [A1 , A2 , . . . , An ] where each attribute Ai,i=1,... n is either discrete or continuous. We also consider C = c1 , c2 , . . . , cm the set of the different class labels and I the universe of the total values of A in the considered domain. Each data item x in I may be written as a vector x = [v1 , v2 , . . . , vn ] where each vi corresponds to a value of Ai . Since there is a class label for every sample x in I then there exists a function F that assigns a class label c ∈ C to each item x ∈ I. This means that this function is from I to C such that F(x) = c.

4.2.1

Classifier Building task

In order to build a general classifier, a training data set should be available in which each instance has its corresponding class. If we consider T a predefined class labeled items, then each item ti in T is described as a vector [xi , ci ], where xi ∈ I and ci is the corresponding ˆ This known class of xi . The goal of an induction algorithm is to construct a new function F. function should be more optimized than F since a simple scan of the training data set by testing each value data item may produce the simplest classification function F and should successfully generalize beyond observed data. That is classifying new items that were not seen during the training set. The effectiveness of a classification model corresponds to its classification accuracy on the training data set that is used to build the classifier and its generalization accuracy over new unseen data sets.

4.2.2

Machine learning approaches

Machine learning algorithms have proven to be great practical value in a variety of application domains. They are especially useful in (a) data mining problems where large databases may contain valuable implicit regularities that can be discovered automatically; (b) poorly understood fields where humans might not have the knowledge needed to develop effective algorithms (e.g., human face recognition); and (c) domains where the program must dynamically adapt to changing conditions (e.g, controlling manufacturing processes under changing supply stocks or adapting to the changing reading interests of individuals).

4.3 Nearest neighbor and decision trees for intrusion detection

77

Machine learning draws its ideas from numerous set of disciplines, including artificial intelligence, statistics computational complexity, information theory, etc. Learning involves search through a space of possible hypotheses to find the best hypothesis that fits the available training examples and other prior constraints or knowledge. While there are many techniques developed in the field of machine learning, we focus in our research for intrusion detection on two different techniques. The first is the nearest neighbor algorithm and the second is decision trees induction algorithm. Despite its time complexity, we used the nearest neighbor as a first algorithm for our experiments because one of our main goals is to test whether a combination between a reduction technique and a machine learning technique does not affect the results of the detection. That is we perform a pairwise comparison to check whether all elements preserve the same characteristics as in their first state.

4.3

Nearest neighbor and decision trees for intrusion detection

This section presents the two methods, namely the nearest neighbor and decision trees. We give a presentation of each of them and show the transformation that should be performed over the different attributes constituting different instances that represent the training and the test examples used for intrusion detection. While we limit our discussion on decision trees and nearest neighbor, other machine learning approaches may also be applied and combined with the reduction technique based on principal component analysis that we call eigenneconnections.

4.3.1

Nearest neighbor classification

One of the easiest method in machine learning field is the nearest neighbor method or NN. It consists in classifying new observations into their appropriate categories by a simple comparison with the known well classified observations. Recall that the only knowledge we have is a set of xi, i=1,..,M points correctly classified into categories. It is reasonable to assume that observations which are close together -for some appropriate metric- will have the same classification. Thus, when classifying an unknown sample x, it seems appropriate to weight the evidence of the nearby’s heavily. One simple non-parametric decision procedure of this form is the nearest neighbor rule or NN-rule. This rule classifies x in the category of its nearest neighbor. More precisely, we call x0 a nearest neighbor to x if min d(x, xi ) = x0 , where i = 1, .., M and d is the distance between the two considered points such as the Euclidean distance. After its first introduction by Fix and Hodges [Fix & Hodges, 1951], the NN classifier has been used and improved by many researchers [Bay, 1998, Dasarathy, 1991] and employed on many data sets from UCI repository [Hettich & Bay, 1999]. A common extension is to choose the most common class in the kNN. The kNN is performed on KDD 99 intrusion detection data sets by Eskin et. al [Eskin et al., 2003]. It was applied for another purpose where the data set is filtered and the percentage of attacks is reduced to 1.5% in order to perform unsupervised anomaly detection. In the following, we are interested in applying the N N classifier on the different data sets with its simplest form. That is compute all possible distance pairs between the training data set and the test data set records.

78

Eigenconnections and supervised techniques to intrusion detection

Since our data sets consist of continuous and discrete attributes values, we have converted the discrete attributes values to continuous values according to the following idea. Consider we have Σi possible values for a discrete attribute i. For each discrete attribute correspond | Σi | coordinates. There is one coordinate for every possible value of the attribute. Then, the coordinate corresponding to the attribute value has a value of 1 and all other remaining coordinates corresponding to the considered attribute have a value of 0. As an example, if we consider the protocol type attribute which can take one of the following discrete attributes tcp, udp or icmp. Then, there will be three coordinates for this attribute. If the connection record has tcp  (resp. udp or icmp)  as a protocol  type then the corresponding coordinates will be 1 0 0 (resp. 0 1 0 or 0 0 1 ). With this transformation, each connection record in the different KDD 99 data sets will be represented by 125 coordinates (3 different values for the protocol type, 11 different values for the flag attribute, 67 possible values for the service attribute and 0 or 1 for the other remaining 6 discrete attributes) instead of 41 according to the above discrete attributes values transformation.

4.3.2

Classification by decision trees induction

Decision trees learners trace their origins back to the work of Hunt and others in the late 1950s [Hunt, 1962]. At least two seminal works are to be mentioned, those by Quinlan [Quinlan, 1986] and by Breiman et al. [Breiman et al., 1984]. The former synthesizes the experience gained by people working in the area of machine learning and describes a computer program called ID3, which has evolved in a new system, named C4.5 [Quinlan, 1993]. The latter originated in the field of statistical pattern recognition and describes a system, named CART (Classification And Regression Trees), which has mainly been applied to medical diagnosis. A decision tree is a tree that has three main components: nodes, arcs, and leaves. Each node is labeled with a feature attribute which is most informative among the attributes not yet considered in the path from the root, each arc out of a node is labeled with a feature value for the node’s feature and each leaf is labeled with a category or class. Decision trees classifiers are based on the “divide and conquer” strategy to construct an appropriate tree from a given learning set S containing a finite and not empty set of labeled instances. The decision tree is constructed during the learning phase, it is then used to predict the classes of new instances. Most of the decision trees algorithms use a top down strategy; i.e from the root to the leaves. Two main processes are necessary to use the decision tree: • Building process It consists in building the tree by using the labeled training data set. An attribute is selected for each node based on how it is more informative than others. Leaves are also assigned to their corresponding class during this process. To measure how informative a node is, Shanon entropy is used to construct the decision trees. The selection of the best attribute node is based on the gain ratio GainRatio(S, A) where S is a set of records and A a non categorical attribute. This gain defines the expected reduction in entropy due to sorting on A. It is calculated as the following [Mitchell, 1997]:

Gain(S, A) = Entropy(S) −

X v∈V alues(A)

| Sv | Entropy(Sv ) |S|

(4.1)

4.3 Nearest neighbor and decision trees for intrusion detection

79

In general, if we are given a probability distribution P = (p1 , p2 , .., pn ) then the information conveyed by this distribution, which is called the Entropy of P is :

Entropy(P ) = −

n X

pi log2 pi

(4.2)

i=1

If we consider only Gain(S, A) then an attribute with many values will be automatically selected. One solution is to use GainRatio instead [Quinlan, 1986]:

GainRatio(S, A) =

Gain(S, A) SplitInf ormation(S, A)

(4.3)

where SplitInf ormation(S, A) = −

c X | Si | | Si | log2 |S| |S|

(4.4)

i=1

where Si is a subset of S for which A has a value vi . This partitioning strategy is used to build the tree, having as a main goal to divide the considered training example by selecting recursively the best non categorical attribute. In the case of a discrete valued attribute, this strategy tests all possible values of the attribute under consideration. However, in the case of continuous-valued attributes a transformation technique is introduced [Quinlan, 1986]. It consists in defining new discrete-valued attributes that partition the continuous attribute into a discrete set of intervals. The algorithm dynamically creates a new boolean attribute At that is true if A < t and false otherwise. The selection of the threshold value t is based on the information gain (see Equation (4.1)). A threshold t is selected if it produces the greatest information gain. The different items according to the continuous attribute A are sorted, then a set of candidate thresholds midway between the corresponding values of A is generated. Fayyad [Fayyad, 1991] showed that the value of t that maximizes information gain lies always at such a boundary. These candidate thresholds are evaluated by computing the information gain associated with each of them. The dynamically created boolean attributes can then compete with the other discrete valued candidate attributes that are available for growing the tree. In the following, we use this partitioning technique for evaluating the attributes with continuous values. • Classification process: A decision tree is important not because it summarizes what we know, i.e. the training set, but because we hope it will classify correctly new cases. Thus, when building classification models one should have both training data to build the model and test data to verify how well it actually works. New instances are classified by traversing the tree from up to down based on their attribute values and the node values until one leaf is reached that corresponds to the class of the new instance. Besides the construction and classification steps, many decision trees algorithms use another optional step. This step consists in removing some edges that are considered useless for improving the performance of the tree in the classification step. Pruning trees simplifies the tree since many useless edges are removed rendering complex trees more comprehensive for interpretation. In addition, a tree that is already built is pruned only when it gives better classification results than before pruning [Mitchell, 1997].

80

Eigenconnections and supervised techniques to intrusion detection

4.4

Eigenconnection approach for intrusion detection

Chapter 3 described the application we introduced using principal component analysis to differentiate between users. Here, we use PCA as a reduction technique before applying a classification model. We recall the definition of this technique for space reduction and derive some feasible formula for its application over network traffic. Besides transforming correlated variables into a smaller number of uncorrelated variables, Principal component analysis has another objective which consists in reducing the dimensionality (number of variables) of a data set while most of the original variability in the data is kept . The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. In this section, we investigate the eigenconnection approach based on the principal component analysis. In our case, each connection record corresponds to one vector of n variables corresponding to the attributes in the different data sets. The procedure is the following: The n different measures are collected in a vector called connection vector representing the corresponding connection. So, if Γ is a connection vector then we can write:    Γ= 

m1 m2 .. .

    

(4.5)

mn where mi , i = 1, . . . , n correspond to the different measures. In most cases, the connection vectors are very similar and they can be described by the most significant connection vectors. This approach involves the following initialization procedure: 1. acquire an initial set of connection records (this set is called the training set). As for example, if we consider the KDD 99 10% training data set then the number of training examples is about M = 494, 021 connection records; 2. calculate the eigenconnections from the training set, keeping only n0 (n0  n) eigenconnections that correspond to the highest eigenvalues. These n0 connections define the connection space. The eigenconnections are the eigenvectors of the covariance matrix Cn×n where n corresponds to the number of attributes for each connection record(see Equation(3.20)). n0 is chosen using the inertia value given in Equation 3.16. 3. calculate the corresponding distribution in n0 − dimensional− weight space for each known connection record, by projecting their connection vectors onto the connection space; 4. (optional) perform a machine learning algorithm on the new data sets in the new connection space; for the decision tree algorithm, it is necessary to build the tree which will be used in the detection process. However, there is no need to perform the NN algorithm at this stage. This is the reason why this step is optional depending on the machine learning algorithm being used. Having projected the training data onto the new feature space, the following steps are then used to classify and detect intrusions from the new connection records in the test data:

4.5 Experimental data sets

81

1. calculate a set of weights based on the input connection record and the n0 eigenconnections by projecting the input connection record vector onto each eigenconnection, 2. use one of the different machine learning algorithms (classification process) to detect intrusions from the new connection records represented in the new feature space.

4.5

Experimental data sets

Our different experiments, as those of other many researchers [Pfahringer, 2000, Levin, 2000, Eskin et al., 2003, Shyu et al., 2003, Fan et al., 2004] performed during last years, are based on the KDD 99 [Hettich & Bay, 1999] data sets. These data sets are the result of a transformation of raw tcpdump traffic into connection records. The tcpdump traffic is issued from the DARPA 1998 [DARPA 98 and DARPA 99, 2005] intrusion detection evaluation program. In fact, a framework for constructing features for intrusion detection systems is performed in [Lee & Stolfo, 2000]. Therefore, we assume, in the following, that the different features construction for intrusion detection (as a part of the data mining process in intrusion detection) are free of errors and we conduct our experiments for building classifiers over the different KDD99 data sets. The DARPA 98 transformation process and the different attributes that are considered for generating the different connection records of KDD 99 data sets are discussed in Section 4.5.1. The main task for the KDD 99 classifier learning contest [KDD Task, 1999] is to provide a predictive model able to distinguish between legitimate (normal) and illegitimate (called intrusions or attacks) connections in a computer network. The training data set contained about 5, 000, 000 connection records, and the training 10% data set consisted of 494, 021 records among which there were 97, 278 normal connections (i.e. 19.69%). Each connection record consists of 41 different attributes that describe the different features of the corresponding connection, and the value of the connection is labeled either as an attack with one specific attack type, or as normal. The 39 different attack types present in the 10% data sets and their corresponding occurrence numbers in the training and test data sets are given in Table 4.1. Each attack type falls exactly into one of the following four categories: • Probing: surveillance and other probing, e.g., port scanning; • DoS: denial-of-service, e.g. syn flooding; • U2R: unauthorized access to local superuser (root) privileges, e.g., various “buffer overflow” attacks; • R2L: unauthorized access from a remote machine, e.g. password guessing. The task was to predict the value of each connection (normal or one of the above attack categories) for each of the connection record of the test data set containing 311, 029 connections. It is important to note, from Table 4.1, that: 1. the test data set has not the same probability distribution as the training data set; 2. the test data includes some specific attack types that are not present in the training data. There are 22 different attacks types out of 39 present in the training data set.

82

Eigenconnections and supervised techniques to intrusion detection Probing (4, 107; 4, 166) ipsweep(1, 247; 306), mscan(0; 1, 053), nmap(231; 84), portsweep(1, 040; 364), saint(0; 736), satan(1, 589; 1, 633).

U2R(52; 228) buffer overflow(30, 22) httptunnel(0; 158), loadmodule(9; 2), perl(3; 2), ps(0; 16), rootkit(10; 13), sqlattack(0; 2), xterm(0; 13).

DoS(391, 458; 229, 853) apache2(0; 794), back(2, 203; 1.098), land (21; 9), mailbomb(0; 5, 000), neptune(107, 201; 58, 001), pod(264; 87) processtable(0; 759), smurf(280, 790; 164, 091), teardrop(979; 12), udpstorm(0; 2). R2L(1, 126; 16, 189) ftp write(8; 3), guess passwd(53; 4, 367), imap(12; 1), multihop(7; 18), named(0; 17), phf(4; 2), sendmail(0; 17), snmpgetattack(0; 7, 741), snmpguess(0; 2, 406), spy(2; 0), warezclient(1, 020; 0), warezmaster(20; 1, 602), worm(0; 2), xlock(0; 9), xsnoop(0; 4).

Table 4.1: The different attack types and their corresponding occurrence number respectively in the training and test data sets.

The remaining attacks are present in the test data set with different rates towards their corresponding categories. There are 4 new U2R attack types in the test data set that are not present in the training data set. These new attacks are httptunnel, ps, sqlattack, xterm and correspond to 82.90% (189/228) of the U2R class in the test data set. On the other hand, there are 7 new R2L attack types. These attacks are named, sendmail, snmpgettattack, snmpguess, worm, xlock and xsnoop. The occurrence number of these new R2L attacks is 63% (10, 196/16, 189) where 7, 741 are of type snmpgetattack and 2, 406 of type snmpguess. In addition, there are only 104 (out of 1, 126) connection records present in the training data set corresponding to the known R2L attacks present simultaneously in the two data sets. The warezmaster attack is one case where there are only 20 connections of this type in the learning database and 1, 602 in the test data set. However, there are 4 new DoS attack types in the test data set corresponding to 2.85% (6, 555/229, 853) of the DoS class in the test data set and 2 new Probing attacks corresponding to 42.94% (1, 789/4, 166) of the Probing class in the test data set. There are many occurrences of new attack forms for the two classes U2R and R2L in the test data set. The Probing class presents also many occurrences of new attacks forms in the test data set. However, for this class the difference is in the name of the used scanning tool, not in the method with which the probing is performed. We recall the main goal of the KDD99 contest that was to predict the class of the different connections in the test data set as a normal connection or as one specific attack

4.5 Experimental data sets

83

among the different known attack classes. Unfortunately, this classification does not match the objective sought by anomaly intrusion detection since new attacks are, normally, different from known attacks and thus should not be classified as one a priori known class (i.e. normal or one of the different categories mentioned in the training data set). We should mention that the different attacks present in the test data set that do not have any occurrence in the training data set cannot be easily classified into their appropriate class and will be classified in the class that has a form close to theirs. However, if the connection form does not characterize precisely the corresponding attack or the normal traffic as its initial tcpdump form then the classification of the new attacks would be unforeseeable. The process and the different conditions that should be taken by a pre-processing engine for transforming raw data sets into well formed connection records are discussed in Chapter 5. However, the different experiments that are presented in Section 4.6 use the KDD 99 data sets where we suppose that these pre-processed data sets characterize exactly the raw data sets they are issued from. In order to construct valuable behavior models, many features should be gathered to characterize the considered behavior. If our needs consist in knowing how a certain host, in a network, is behaving then the captured traffic is analyzed; if a web site is considered, the http log is investigated, and so on. However, the raw “unstructured” data collected from a network or other sources are not easy to analyze by different classification techniques (such as decision trees, clustering methods, etc.) which need more structured data format to work well. A data preprocessing phase of the gathered raw data must be performed to extract meaningful features and measures.

4.5.1

DARPA 98 data pre-processing

The intrusion detection database KDD 99 is a result of a transformation, into connection records [Lee, 1999], of a tcpdump traffic collected in a local area network, during nine weeks, simulating a typical U.S. Air Force LAN (Local Area Network). The MIT Lincoln Laboratories operated this simulated LAN as if it were a true Air Force environment, but peppered with the 39 different attacks types presented in Table 4.1. The 41 attributes correspond to the different features that characterize a network connection. There are 4 different feature classes for these different attributes. • The first class corresponds to the intrinsic features of network connection records. Table 4.2 presents the different features that are considered in this first class. The tool that was used for summarizing connections from network packets is Bro [Paxson, 1999] that permitted to filter and reassemble the different packets composing a connection between two hosts into one connection record summarized in Table 4.2. • The second class corresponds to content features that are extracted from the different packets data. This features extraction depends on domain knowledge to indicate whether the data contents suggest suspicious behavior. The main goal of this class is to include an extensive set of indicators for further detection during the classification process. However, the different attributes are chosen arbitrarily in [Lee & Stolfo, 2000] since only some attributes are taken into account according to the expert knowledge. Some of these features are: number of failed logins that may give information about the distant application that is perhaps using the dictionary attack to guess a password, number of accesses to some hot access control files such as “/etc/shadow, .rhosts”, etc. (see Table 4.3).

84

Eigenconnections and supervised techniques to intrusion detection feature duration protocol type service

src bytes

dst bytes

flag

land wrong fragment urgent

description duration of the connection in seconds corresponding protocol type of the connection such as tcp, udp, etc. corresponding service type of the connection such as telnet, http, rsh, etc. number of data bytes transferred from source to destination during the whole connection number of data bytes transferred from destination to source during the whole connection summary of the connection status, e.g., normal establishment and termination, connection attempt seen, etc. -“1” if the connection is from the same host and port. -“0” otherwise number of “wrong” fragments due to bad fragmentation. number of urgent packets during a session

value type continuous discrete discrete

continuous

continuous

discrete

discrete continuous continuous

Table 4.2: Intrinsic features of the network connection records.

• The features of the third class are automatically constructed by the MADAM ID [Lee & Stolfo, 2000] tool that uses association rules, as a link analysis in the KDD process for intrusion detection. However, these features are only relevant for probing attacks (such as port scanning) and DoS attacks (such as syn flooding). This is the reason why W. Lee [Lee, 1999] constructed , manually, the second class features that in reality will be used for predicting U2R and R2L attacks. The different features related to this third class are automatically constructed and are summarized into: – The “same host” features that examine only the connections in the past 2 seconds that have the same destination host as the current connection. – The “same service” features that examine only the connections in the same 2 seconds that have the same service as the current connection. These features are called “time based” traffic features for connection record and are summarized in Table 4.4. • Lee [Lee, 1999] noticed that some probing attacks are using a larger time window exceeding 2 seconds. As a result, these attacks do not produce valuable patterns for intrusions with a time window of 2 seconds. Thereafter, he has used a “connection” window of 100 connections and constructed the so called “host based traffic” features as the fourth class. These features are the same as those of Table 4.4 but are calculated according to the last 100 connections preceding the current one. This is the reason why we do not report them here.

4.5 Experimental data sets feature hot num failed login logged in num compromised root shell su attempted num file creations num shells num access files num outbound cmds is hot login is guest login

description number of “hot indicators” proper of Bro tool number of failed login attempts during the current connection 1- logged in with success 0- otherwise number of compromised situations (e.g. path not found) 1- root shell gained 0- otherwise 1- “su” super user attempted 0- otherwise number of file creation operations “0” otherwise number of Shell prompts number of write, delete and create operations on access control files number of outbound commands during an ftp session 1- the login belongs to the sensitive list, such as “root, adm, etc.”; 0- otherwise 1- the login is a “guest” account (e.g. guest, anonymous, etc.); 0- otherwise

85 value type continuous continuous discrete continuous discrete discrete continuous continuous continuous continuous discrete discrete

Table 4.3: Domain knowledge content features of network connection records.

4.5.2

Discussions on the DARPA 98 and KDD 99 pre-processing

Most intrusion detection techniques, but those based on pattern matching, require sets of data to perform experiments. When researchers began to work on advanced intrusion detection in the 1990, they quickly recognized the need for a common data sets to perform these experiments. Such data sets would certainly allow the different techniques to be quantitatively and qualitatively compared. This would also bring an alternative solution to the problem of the proprietary data where results are generally not reproducible where the main cause of this corresponds to the privacy concerns. As a response to this problem, Lincoln Laboratory with a sponsorship of the DARPA created data sets for intrusion detection called DARPA 98 that serve as an evaluation benchmark [DARPA 98 and DARPA 99, 2005]. This proposal was repeated for creating a second data set, DARPA 99 [DARPA 98 and DARPA 99, 2005], containing new attacks and particularly distributed denial of service attacks that appeared in 1999. While the different DARPA intrusion detection data sets have been widely used by many researchers to conduct their experiments, McHugh [McHugh, 2000] severely criticized these data sets. He presented many interesting points on how an evaluation of IDSs should be performed and he mentioned many critics on the DARPA challenge without specifying how difficult it is to address some of the issues. As an example of his critics, he pointed out that the generated data was not compared to the real data to verify that it has the same rates of malicious traffic versus non-malicious anomalous traffic that may cause false positives. He also criticized the topology of the simulated network which does not reflect the real Air Force LAN and the uniform distribution of the four attack categories that were experimented during the seven weeks. McHugh [McHugh, 2000] criticized the DARPA data sets but did not give valuable so-

86

Eigenconnections and supervised techniques to intrusion detection feature count

description value type number of connections to the same host as the continuous current one in the past 2 seconds The following features refer to these connections to the same host during the last 2 seconds serror rate percentage of the “count” connections that have continuous “SYN” errors rerror rate percentage of the “count” connections that have continuous “REJ” errors same srv rate percentage of the “count” connections to the continuous same service diff srv rate percentage of the “count” connections to differ- continuous ent services srv count number of connections to the same service as the continuous current one in the past 2 seconds The following features refer to these connections to the same service during the last 2 seconds srv serror rate percentage of the “srv count” connections that continuous have “SYN” errors srv rerror rate percentage of the “srv count” connections that continuous have “REJ” errors srv diff host rate percentage of the “count” connections to differ- continuous ent hosts Table 4.4: Time based traffic features of network connection records.

lutions to the different interesting remarks. The role of his critics consists in informing the community about the different shortcomings and omissions of the evaluation data sets hoping this would be addressed in the future efforts. Mahoney and Chan [Mahoney & Chan, 2003] investigated the simulation artifacts in the DARPA data set and proposed to solve the problem by mixing the corresponding traffic with real traffic. First, they compared the real traffic they collected from a university department server with the simulated traffic of the DARPA. They found out that many attributes of network traffic that appear to be predictable in the DARPA intrusion detection data sets are less predictable in real traffic. As an example, they pointed out that there are only 9 values of the 256 TTL possible values that are observed in DARPA intrusion detection data sets. Whereas they observed 177 different values in the real traffic they collected. The authors argued that one reason of mixing real traffic with a simulated traffic is due to the great expense this data was generated and could not easily be replaced. In addition they believed that attacks were, for the most part of time, simulated correctly. Therefore, they used in their experiments the real traffic as background and training data and continue to use labeled simulated attacks as before. They mixed the simulated traffic with the real traffic and tested five anomaly detection systems on mixed traffic. With the different analysis and tests they performed, they showed that many attacks which appear to be detected by simulation artifacts are no longer detected when real traffic is added. While this approach is a good advance to achieve a more realistic performance measure using DARPA data sets, it is not appropriate for research for two reasons. First, it requires attacks free, or at least accurately labeled, real world data which is not available to share by the community as a standard

4.6 Experimental methodology and results

87

for comparison. On the other hand, the anomaly detection method that is used should not differentiate between the DARPA data and the real world data. These two points are very hard to achieve since finding and removing all of the hostile traffic would be difficult and evaluations using real traffic are still not repeatable because privacy and security concerns usually prohibit the release of this data off-line. They [Mahoney & Chan, 2003] agreed that their approach allows partial reproducibility based on a publicly available benchmark. Lee [Lee, 1999] has done a great effort on the DARPA 98 and defined the 41 attributes as described in Section 4.5.1 for evaluation. The KDD 99 [KDD Cup, 1999] contest is based on the transformation done by Lee of the different tcpdump data into connection records composed of 41 different attributes. As described in Section 4.5.1, some attributes correspond to intrinsic features of the connections, others are automatically extracted from the data by using known attacks namely probing and DoS attacks. However, the different attributes that are related to suspicious user behavior concerning password guessing and buffer overflows are defined using expert knowledge. Whether these attributes are sufficient or not to describe malicious payloads of the different U2R and R2L attacks has never been discussed in the literature. Although DARPA intrusion detection data sets are criticized, much work has been invested over these data sets. Much of the research over DARPA data sets was performed over the KDD 99 data [Lee & Stolfo, 2000, Eskin et al., 2003, BenAmor et al., 2004, Fan et al., 2004] and currently continues over the same data sets. Some works used the whole KDD 99 training data sets and followed a misuse detection technique. Others used a portion of this data for anomaly detection where they put an hypothesis that there are very few anomalies than normal data. We also have performed our experiments over the KDD 99 data sets and the following sections give the results we obtained. In the following, we use the different KDD 99 data sets considering the two data sets with the different attacks as presented in Table 4.1 and report the different results that are obtained using the reduction techniques. The different results that are obtained after the reduction phase are compared with those obtained before reduction.

4.6

Experimental methodology and results

We present the different results and experiments obtained when directly applying the methods discussed in section 4.3 on the different KDD 99 cup data sets or with a combination with Principal Component Analysis. For the combination with the principal component analysis all data set are projected onto the new space generated by the few number PCA’s principal axes. The two supervised algorithms described above are then applied to these projected data in the new reduced PCA’s space. The accuracy of each experiment is based on the percentage of successful prediction (PSP) on the test data set. P SP =

4.6.1

number of successf ul instance classif ication number of instances in the test set

(4.6)

Nearest neighbor with/without PCA

The first experiment, we perform, consists in evaluating the nearest neighbor algorithm on the KDD 99 database. The main problem encountered when applying the nearest neighbor is that it is computationally expensive to compute the nearest neighbor of each point.

88

Eigenconnections and supervised techniques to intrusion detection

The complexity of this computation is O(nm) where n is the number of connection records in the training data set and m corresponds to the number of connection records in the test data set. Each distance computation between two connection records depends on the number of space coordinates. Of course, this algorithm may be approximated with a cluster based estimation algorithm [Han & Kamber, 2001]. However, the distance between two connection records remains dependent on the coordinates number of the feature space. For this reason, we have projected the different data sets connections on the new feature space generated by the principal component axes. There are 125 coordinates for each connection in KDD 99 after transformation to discrete attribute values as explained in section 4.3 (41 attributes added to the representation of each discrete value that has at least two coordinates). Nearest Neighbor without PCA Table 4.5 presents the confusion matrix when applying directly the nearest neighbor on the feature space generated by these 125 coordinates. The lines correspond to the actual class of the connections whilst the columns correspond to the predicted class. For example the cell located in the intersection between the second line and the second column whose value is 17.21 corresponds to the ratio of the connection records whose actual class is Probing that are classified as a normal traffic. Predicted as Actual Normal(60593) Probing(4166) DoS(229853) U2R(228) R2L(16189)

Normal

Probing

99.50% 0.26% 17.21% 72.01% 2.87% 0.12% 39.96% 18.80% 96.12% 2.65% PSP=92.05%

DoS

U2R

R2L

0.24% 10.28% 97.01% 32.01% 0.00%

0.00% 0.00% 0.00% 6.60% 0.02%

0.00% 0.50% 0.00% 2.63% 1.21%

Table 4.5: Confusion matrix obtained with the nearest neighbor algorithm on 125 coordinates.

Nearest Neighbor with PCA Using the same algorithm, we have experimented the test data set on a new feature space generated by at most seven PCA’s axes. We have performed the different experiments by considering 2, 3, . . . , or 7 axes. The results are not much different from each other when we consider from 2 to 7 axes. Table 4.6 shows the confusion matrix when we consider four axes (i.e. each connection record in the different data sets is represented by only four coordinates). The confusion matrix in Table 4.6 shows that the results after PCA application are slightly better. In addition, the computation time is reduced by a factor of approximately thirty (∼ 125/4) when considering 4 principal components. Hence, it is better to reduce the space on which the connection records are represented before applying any machine learning algorithm. This first experimentation is used to show that a combination between PCA and the nearest neighbor performs well even if a few axes are considered (at most seven) to represent the records. According to Equation (3.16), the inertia ratio is close to 1 (99.99) when considering only 4 axes. This is the reason why a representation with only four axes provides a good prediction rate.

4.6 Experimental methodology and results Predicted as Actual Normal(60593) Probing(4166) DoS(229853 U2R(228) R2L(16189)

Normal

Probing

99.50% 0.27% 13.87% 74.40% 2.68% 0.18% 35.96% 14.47% 97.49% 1.71% PSP=92.22%

89 DoS

U2R

R2L

0.23% 11.37% 97.14% 39.03% 0.00%

0.00% 0.00% 0.00% 7.91% 0.00%

0.00% 0.36% 0.00% 2.63% 0.80%

Table 4.6: Confusion matrix obtained with the nearest neighbor on 4 coordinates after performing PCA.

In the two experiments, the two last classes R2L and U2R are not well detected. The maximum PSP for U2R class is 7.91% and 1.21% for R2L. While the time consumed in the first experiment corresponds approximately to one week, it is reduced to only 5 hours after considering the reduction technique of the feature space of the different measures.

4.6.2

Decision trees with/without PCA

This section presents experimental results using decision trees with the C4.5 algorithm [Quinlan, 1993]. This latter is applied, in the first experiment, directly on the different data sets using the whole 41 attributes and then compared to its application on the data sets but after their projection onto the new space generated by the few principal component axes number. During our experiments, we have considered, as in [BenAmor et al., 2004], two cases. The first consists in grouping the whole 39 attacks types into four attack categories before training. In the second case, they are gathered after classification. Decision trees without PCA In this section, we present the different results obtained when applying directly the C4.5 algorithm on rough data. Table 4.7 evaluates the application of the C4.5 algorithm on the data set by gathering the whole attacks into four categories before the training step and its application on the rough data set after classification. The values between parentheses correspond to gathering the whole attacks results into five categories after classification. The different results obtained in this first experiment show that gathering the attacks before training or after classification does not influence the percentage of successful prediction. The two classes U2R and R2L are classified with a percentage of successful prediction of at most 5.84%. For instance, we can consider two different hypotheses for the low prediction ratio of these last two attacks. The first is due to the low number of samples of these two classes in the training set; 0.01% examples of U2R in the training set (resp. 0.23% of R2L) versus 0.07% of U2R (resp. 5.20% of R2L) in the test data set. The second is due to the new forms of these two attack classes that appear in the test data set which are not present in the training set. However, in the following chapter we investigate in depth the different data sets and improve decision trees for improving the detection ratio of the classes that present new forms in the test data set.

90

Eigenconnections and supervised techniques to intrusion detection Predicted as Actual Normal(60593) Probing (4166) DoS (229853) U2R (228) U2R (228) R2L (16189) R2L (16189)

%Normal

%Probing

%DoS

99.49 0.36 0.12 (99.42) (0.39) (0.15) 21.32 74.70 3.98 (15.75) (78.80) (5.45) 2.68 0.00 97.31 (2.58) (0.46) (96.96) 90.79 1.75 0.44 (56.58) (28.51) (0.88) 92.03 2.10 0.01 (94.63) (0.07) (0.00) PSP=92.60%,(PSP=92.35%)

%U2R

%R2L

0.00 (0.00) 0.00 (0.00) 0.00 (0.00) 4.39 (5.26) 0.02 (0.03)

0.02 (0.03) 0.00 (0.00) 0.00 (0.00) 2.63 (8.77) 5.84 (5.27)

Table 4.7: Confusion matrix relative to five classes using the C4.5 algorithm.

Decision trees with PCA We now apply the C4.5 on the new feature space generated by the principal axes. All the training data set and the test data set connection records are projected onto the new feature space. This new feature space is generated by at most 7 axes in our different experiments to validate the results of combining PCA with the C4.5 decision trees algorithm. Predicted as Actual Normal(60593) Probing(4166) DoS(229853) U2R(228) R2L(16189)

%Normal

%Probing

%DoS

99.00 0.85 0.12 (98.99) (0.84) (0.12) 29.60 66.80 3.50 (30.20) (66.30) (3.50) 2.42 0.33 97.25 (2.42) (0.33) (97.25) 92.98 0.00 0.44 (91.23) (0.00) (0.00) 99.94 0.00 0.06 (97.69) (0.00) (0.01) PSP=92.05%(PSP=92.16%)

%U2R

%R2L

0.00 (0.00) 0.10 (0.00) 0.00 (0.00) 6.58 (8.33) 0.00 (0.00)

0.03 (0.04) 0.00 (0.00) 0.00 (0.00) 0.00 (0.44) 0.01 (2.30)

Table 4.8: Confusion matrix relative to five classes using the C4.5 algorithm after data set projection onto two principal component axes. According to Table 4.8 and 4.111 , there is a slight difference between the use of decision trees on rough data and their combination with PCA on the new feature space. However, it is important to mention that the number of nodes in the decision tree generated when we apply C4.5 on rough data is greater than that of nodes in the decision tree when applied with the different data sets but in the new feature space generated by the PCA. In addition, the training time consumed to construct the decision tree with the 1

PCAi corresponds to the projection of the data onto the first i principal axes corresponding to the first i highest eigenvalues.

4.7 Summary

91

new data in the feature space, generated by at most seven principal components, is more interesting as presented in Table 4.9.

Number of nodes Training time

Decision trees without PCA ∼ 1500 before pruning ∼ 700 after pruning ∼ 3mn40sec

Decision trees with PCA ∼ 330 before pruning ∼ 211 after pruning ∼ 50sec

Table 4.9: Time and tree size with/without PCA.

Furthermore, the problem of the prediction ratio with the last two classes persists always as mentioned in the previous subsection. The highest predicted successful ratio obtained with the R2L class does not exceed 2.30%. This is because the principal components associated with the smallest eigenvalues often correspond to not interesting information [Jolliffe, 2002]. In our case it corresponds to the classes that are not present with high rates in the training set while we are taking into account only the highest eigenvalues. This is the reason why the R2L class prediction ratio is very small. To circumvent this problem, we have considered other axes corresponding to lower principal axes. In this case, we have obtained 5.86% as the highest prediction ratio for the R2L class. Attack Category Detection Ratio

Normal (60593) 99.52%

Probing DoS (4166) (229853) 78.84% 98.26% PSP=92.63%

U2R (228) 12.72%

R2L (16189) 5.86%

Table 4.10: The best prediction ratios obtained for each class when considering different principal components.

According to Table 4.10, the results obtained by combining decision trees with PCA are slightly better than those in Table 4.7 when applying directly decision trees on rough data.

4.7

Summary

In this chapter we described the knowledge discovery for intrusion detection. It mainly consists in extracting meaningful features from raw data sets, such as network traffic, that may be used for distinguishing between normal and intrusive data. The classification of new data sets is performed using supervised classification techniques. We introduced a new idea on how to reduce the different representation spaces before applying some machine learning algorithms on connection records for misuse intrusion detection. Since data sets on intrusion detection containing features extracted from the network traffic are made highly available by the 1998 DARPA intrusion Evaluation Program, we neither focus our research on feature extraction nor examine the exactitude of the transformation performed over these data sets using the MADAM ID tool [Lee, 1999]. However, we use directly and blindly, as many other researchers, the different KDD 99 data sets. The reduction technique we introduced as a step for building a classification model for intrusion detection significantly enhances the learning and detection time with a similar successful prediction in the whole experiments. However, the two classes R2L and U2R remain detectable with low rate either by combining a supervised technique with PCA or not. Moreover, the main drawback which persists in

92

Eigenconnections and supervised techniques to intrusion detection

combining decision trees or the nearest neighbor with PCA is the poor prediction rate of the R2L class which is most of the time classified as normal with a rate of more than 99% in almost all cases. We notice that there are many new attacks that are present in the test data set but not in the training data set. The classification of a new attack as one of the four classes or a normal class should be improved if we want to detect these new attacks, those that are different from the attacks present in the training data set, as new ones. Current supervised techniques cannot detect these new attacks since they only classify new instances as one of the known classes seen during the training stage. One solution to this limitation consists in improving these supervised techniques to detect and classify new attacks. In the following chapter, we improve some supervised techniques for detecting known and new attacks and we show the effectiveness of these new improved techniques.

4.7 Summary

Predicted as Actual Normal PCA3 PCA4 PCA5 PCA6 PCA7 Probing PCA3 PCA4 PCA5 PCA6 PCA7 DoS PCA3 PCA4 PCA5 PCA6 PCA7 U2R PCA3 PCA4 PCA5 PCA6 PCA7 R2L PCA3 PCA4 PCA5 PCA6 PCA7

93

%Normal

%Probing

%DoS

%U2R

%R2L

0.13(0.12) 0.13(0.12) 0.22(0.27) 0.16(0.15) 0.24(0.02)

0.00(0.00) 0.01(0.01) 0.00(0.16) 0.00(0.02) 0.00(0.06)

0.05(0.04) 0.07(0.02) 0.15(0.02) 0.02(0.02) 0.02(0.36)

28.88(30.63) 24.20(30.56) 13.85(16.30) 18.31(22.56) 14.98(22.71)

0.35(0.32) 0.36(0.34) 0.39(0.37) 0.36(0.34) 0.35(0.14) (4166) 67.67(62.12) 69.40(63.08) 76.26(73.09) 76.48(69.40) 75.76(69.28)

0.10(0.00) 0.00(0.00) 0.00(0.29) 0.00(0.02) 0.00(0.00)

0.00(0.00) 0.02(0.02) 0.96(0.02) 0.00(0.00) 0.00(0.00)

2.60(2.75) 2.53(2.76) 2.70(2.69) 2.76(2.77) 2.71(2.75)

0.20(0.00) 0.27(0.05) 0.02(0.06) 0.02(0.03) 0.04(0.03)

3.36(7.25) 6.39(6.34) 8.93(10.30) 5.21(8.02) 9.27(8.02) (229853) 97.18(97.25) 97.17(97.17) 97.26(97.16) 97.20(97.06) 97.22(97.06)

0.02(0.00) 0.02(0.02) 0.02(0.09) 0.02(0.14) 0.02(0.16)

85.09(65.35) 68.86(77.63) 35.53(62.28) 38.16(41.23) 38.16(41.23)

1.75(9.65) 9.65(5.70) 47.37(19.74) 47.37(3.51) 47.37(3.51)

0.44(14.91) 9.65(9.65) 11.40(13.60) 7.46(48.25) 7.46(48.25)

0.00(0.00) 0.00(0.00) 0.00(0.00) 0.00(0.00) 0.00(0.00) (228) 12.72(8.33) 5.26(6.58) 4.39(3.95) 5.70(4.39) 5.70(4.39)

(60593) 99.47(99.52) 99.43(99.51) 99.23(99.19) 99.46(99.47) 99.39(99.42)

99.60(99.60) 0.02(0.02) 0.02(0.02) 0.01(0.01) 99.83(99.81) 0.03(0.01) 0.03(0.03) 0.01(0.02) 99.81(99.38) 0.02(0.14) 0.04(0.04) 0.03(0.04) 99.82(99.43) 0.02(0.14) 0.03(0.02) 0.04(0.05) 99.81(99.42) 0.02(0.14) 0.02(0.02) 0.06(0.06) PSP3=92.12%(PSP3=92.11%),PSP4=92.12%(PSP4=92.05%) PSP5=92.23%(PSP5=92.13%),PSP6=92.24%(PSP6=92.06%) PSP7=92.23%(PSP7=92.05%)

0.00(1.75) 6.58(0.44) 1.32(0.44) 1.32(2.63) 1.32(2.63) (16189) 0.35(0.35) 0.11(0.12) 0.10(0.41) 0.09(0.36) 0.09(0.36)

Table 4.11: Confusion matrix relative to five classes using the C4.5 algorithm after data set projection onto 3, 4, 5, 6 or 7 principal component axes.

Chapter 5

Novel attacks detection It is well known that signature based intrusion detection systems are only able to detect known attacks. Unfortunately, current anomaly based intrusion detection systems are also incapable of detecting all kinds of new attacks since they are designed to restricted applications on limited environment. As an example, the immune technique discussed in Section 2.2.3 applies only to system calls of some known server daemon applications present in the UNIX or Linux environment. In this chapter, we briefly report some related work in anomaly intrusion detection for network traffic and particularly for the KDD 99 and DARPA 98/99 data sets. We give the reasons why the two attacks types, namely U2R and R2L, are not detected with high prediction rates when we use the different classification techniques as those presented in Chapter 4. We enhance two supervised techniques for network traffic anomaly detection to discover known and unknown attacks. Experimental results demonstrate that the proposed methods are highly successful in detecting new attacks and significantly outperform previous work. Some results of this chapter are reported in [Bouzida & Cuppens, 2006] and shown the necessity to detect new network intrusions.

5.1

Related Work

In Chapter 4, we investigated the data mining procedure as a process for constructing and detecting known intrusions. The detection process is based on the classification techniques particularly on supervised machine learning techniques for misuse detection. There are two kinds of learning techniques. The first is supervised and the second is unsupervised. Supervised techniques use a learning database whose different records are labeled with their appropriate category. The goal of a supervised method is to learn how to produce the correct category given a new input record. Unlike the supervised learning, unsupervised learning does not rely on predefined classes and class labeled training examples. Much of work done over KDD 99 for intrusion detection failed to detect new attacks when the whole data sets is considered. However, some researchers tried to detect new anomalies by filtering the different data sets and considering only small numbers of attacks for their experiments. They used for this purpose unsupervised techniques where anomalies correspond to outliers after clustering. In the following, we discuss different techniques used for network anomaly detection done over different DARPA 98 data sets. The first is an adaptive Bayesian model called EBayes TCP. The second describes unsupervised techniques that are performed over DARPA 98

96

Novel attacks detection

data sets where these data sets are filtered. Finally, the last technique generates artificial anomalies for network anomaly detection.

5.1.1

EBayes TCP—Adaptive Bayesian model

In [Valdes & Skinner, 2000], Valdes and Skinner presented an adaptive model based technique for attack detection using Bayes inference network technology [Pearl, 1988] to analyze bursts of traffic. This model is integrated as a component of the broad EMERALD system [Porras & Neumann, 1997] and may process either live TCP traffic or tcpdump data in batch mode by relying on the EMERALD ETCPGEN and EMONTCP components [Valdes & Skinner, 1998]. EBayes TCP consists of two components: a TCP-specific module that interfaces to appropriate EMERALD components as well as a Bayesian inference class library. In mathematical words, their framework is adapted for belief propagation in causal trees from Pearl [Pearl, 1988]. Knowledge is represented as nodes in a tree, where each node is considered as one of several discrete states. A node receives prior (or causal support) messages from its parent and likelihood (or diagnostic support) messages from its children as events are observed. Priors are propagating downward through the tree and likelihood is propagating upward and correspond to discrete distributions (i.e. they are positive valued and sum to unity). A prior message incorporates all information not observed at the node and the likelihood at terminal or leaf nodes corresponding to the directly observable evidence (i.e. observable variables or observables). A conditional probability table denoted CPT, which links a child to its parent, is maintained and adjusted according to some (a particular or corresponding) process discussed below. Figure 5.1 illustrates a simple bayesian tree with its corresponding conditional probability table. It presents the a priori belief that the probability of many open connections (0.7) and different IP addresses (0.8) are high whereas the probability for different unique ports (0.3) is low when an IP Sweep is observed. This is due to the fact that an attacker scans many IP addresses of a monitored subnetwork while only few ports are scanned on each unique host. When the three observables are determined during system runtime, the probability for the hypothesis class (i.e. IP Sweep) may then be computed using the Bayesian inference mechanism [Pearl, 1988]. EBayes TCP works as the following. First, the designers consider a burst of traffic as evidence of an anomaly. A burst of traffic correponds to a sudden huge amount of packets that are observed in a very short period (for instance, one second for a port scan). In their simplest model, they consider either temporally contiguous traffic from a particular IP address as a session or the traffic to a protected resource as for detecting distributed attacks. Each time an event occurs the matching session is updated with the new event, otherwise a new session structure is allocated and its observables are initialized with the event data. Two ways for managing the growth of the sessions are implemented. Either by using a timeout or when the session table is at a maximum size set. The session with the most distant last event time is deallocated. The Bayes Inference engine is invoked periodically throughout each session and when a session is deallocated. The sessions are not deallocated immediately after connections are closed because some attacks such as mailbomb consist of a large number of successful open/close events and all of them look like normal events. The system should not wait indefinitely for all open connections because of half open connections attacks (syn-flooding). Moreover, for this kind of attacks a timeout is set. The Bayesian inference network used in EBayes TCP consists only of one root that corresponds to the (unobservable) session class and several observed and derived variables

5.1 Related Work

97

Conditional Probability Table (CPT) for IPSweep observable

true

false

Max Open. Conn.

0.7

0.3

Nb. Unique. IP Addr. NB. Unique.Ports

0.8 0.2

0.2 0.8

IP Sweep

Attack Class (Hypothesis)

diagnostic support causal support

Observables

Max Open Conn.

Number of Unique IP Adresses

Number of Unique Ports

Figure 5.1: Bayesian tree example for IP Sweep. (i.e observables) from the tcpdump data as children. Figure 5.2 presents the structure of the Bayesian network considered for the different experiments. This structure is assumed to be updated each time a burst of traffic occurs or the timeout elapses. The different nodes of this structure are: • Session class: This corresponds to the unobservable variable corresponding to the session class that should be determined. The session class is located at the root node. The different hypotheses classes taken into consideration in the experiments are: mail, http, ftp, telnet/remote usage, other normal, http f, dictionnary, processtable, mailbomb, portsweep, ipsweep, synflood, and other attack. The first five hypotheses correspond to normal classes, whereas the others are attack classes. • Event and error intensity: For each observed event during a session corresponds an intensity measure. The range of these measures is categorized to obtain the observed state values for the respective nodes. • Max Open to Any host: These variables are known as high water mark measures. The system maintains a list of ports and hosts accessed by the current session. Using this list, the maximum of open connections to any unique host is derived. The total number of different ports and hosts accessed by a session is also recorded. The range of these measures is categorized to obtain the respective node states. • Service distribution: This variable monitors the distribution of the categories over an underlying discrete valued measure. The classification of events according to service category is performed (for example into mail, http, ftp, or other ). • Number of unique ports: This variable counts the number of different ports that are seen during a connection.

98

Novel attacks detection

Session Class

Connect Code Distribution

Event Intensity

Number of Unique Ports

Max Open to Any Host Error Intensity Service Distribution

Number of Unique IP Adresses

Figure 5.2: EBayes TCP belief tree structure. • Number of unique IP addresses: This variable counts the number of connections to different IP addresses. • Connect code distribution: This variable monitors the distribution of the connection code that may be a half open connection, a connection that is terminated correctly, connection attempt is rejected, etc. The corresponding code is identified using the different flags seen over the different packets in the current session under evaluation. As in event intensity and Max open to any host variables, the range of the different measures of this variable is also categorized to obtain the respective node states. One should note that EBayes TCP is coupled with a service availability monitor that learns valid hosts and services in the protected network via a process of unsupervised discovery; i.e. the topology of the network is used for some of the above observable measures. However, the authors did not explain how this unsupervised technique is used for discovering the different services and whether it is possible for an attacker to mislead this component. For example, an attack may consist in sending packets faking the real presence of some services and hosts. This situation is possible when a machine in the protected network is compromised. We assume in the following that their system is safe and the unsupervised discovery finds only real hosts and services. EBayes TCP has the adaptive ability to evolve on-line by adjusting the conditional probability table (CPT). In addition to this adaptive ability, the inference model can create new hypotheses corresponding to new attacks that are not considered a priori. The authors experimented two cases. In the first case they considered all the hypotheses classes described above. They noticed that the system does not add more hypotheses. They conclude that no more than these hypotheses are needed to classify these data. In the second case, they initialized the system with a single valid hypothesis at the root. The corresponding model is tested against a week of normal data. TCP EBayes generated two

5.1 Related Work

99

valid states. The other experiments consisted in testing it against data known to contain attacks. The system added two new states which are then considered as two new attacks. This model may detect some new attacks that are not considered a priori. However, the different hypotheses taken here are restrictive since EBayes is only effective against burst of traffic and non stealthy probe attacks. However, the authors [Porras & Neumann, 1997] mentioned that it is moderately effective against stealthy attacks since it has detected one stealthy probe over three in the DARPA 99 data sets during the different experiments performed using this inference model. In Section 4.5.1, the stealthy attacks corresponding to probes that last more than one minute are taken into consideration in KDD 99. The hypothesis taken in MADAM/ID [Lee et al., 1999] for constructing the fourth class solve the problem of stealthy probing attacks. For instance, MADAM/ID [Lee, 1999] considers the features of the last 100 connections preceding the current one in order to construct the fourth class. Indeed, in our experiments over the DARPA 98 (KDD 99), the detection rate of the stealthy and non stealthy probes is highly successful as reported in Section 4.6. EBayes TCP uses an enhanced statistical technique to classify network traffic. However, the payload is never inspected. In addition, EBayes TCP can detect only attacks that generate bursts of traffic and considers only tcp traffic. Therefore, its generalization to other attacks that exploit buffer overflow or any other vulnerability that does not generate a burst of traffic would never be detected. We should introduce new techniques or enhance existing ones that may be used perfectly in any situation. Sections 5.3 and 5.4 discuss one solution for detecting new attacks without considering some hypotheses that are not always realistic in real life.

5.1.2

Unsupervised anomaly detection

There are many experiments performed for unsupervised learning for anomaly detection. Emran and Ye [Emran & Ye, 2001] use audit events containing both normal and intrusive activities. They convert the input audit event stream into observation stream using a most recent past Exponentially Weighted Moving Average (EWMA) technique based on the type of event. They capture the long term profile of the observation stream which represents expected normal activities. They use only normal events during training. A multivariate statistical based anomaly detection technique called Canberra technique [Emran & Ye, 2001] was developed and applied to intrusion detection where audit events are used. The method does not suffer from the normality assumption of the data, however, they showed that their technique performs well in only some cases particularly when all attacks are placed together. This means that we have an a priori knowledge about the distribution of normal and abnormal audit events stream. In [Shyu et al., 2003], Shyu et al. used principal component analysis for anomaly detection. They considered the KDD 99 data sets. However, during their experiments, they only consider attributes that are numerical. This corresponds to use only 34 attributes during their experiments. The remaining seven attributes —those that mainly characterize the R2L and U2R classes— are not taken into account. To perform their experiments, they only take some portion of the data sets from the KDD 99 data set. They claim they obtained good results but this analysis is not correct because when considering only some attributes many attacks would not be detected. This suggestion [Shyu et al., 2003] cannot detect neither unknown attacks nor known attacks particularly U2R and R2L attacks since the second class of attributes are mainly designed for these two attacks. These attributes are mostly discrete ones, as they are described in Chapter 4 (for more details on these attributes see Section 4.5.1). Moreover, their experiments are not realistic since only a small number of connection records are considered where only some

100

Novel attacks detection

attributes are considered. Which attacks are taken into account during their experiments are not presented in their paper. However, they considered 5 different experiments where in each experiment they tested a set of 92, 279 normal connections and 39, 679 attack connections. Of course, if they had used only Probe and DoS attacks, this method would not have been useful since any classification technique would detect these attacks with highly successful rates. The real problem with KDD99 intrusion detection contest is the detection of the attacks that are never detected with highly successful rates with any algorithm used until now. In the following, we present the different unsupervised techniques performed by E. Eskin et al. in [Eskin et al., 2003]. We discuss their different techniques and show why they are not of great interest for intrusion detection and why the different hypotheses they considered are not always true in real world, particularly for detecting new attacks. In their different experiments, Eskin et al. [Eskin et al., 2003] considered three different algorithms, namely, cluster based estimation, K-nearest neighbors and one class support vector machines (SVM) [Hearst, 1998]. The clustering methodology is well known and is studied at length in many fields including statistics [Schnell, 1964], databases [Hinneburg & Keim, 1999, Zhang et al., 1996a, Guha et al., 1998, Wang et al., 1997], machine learning and visualization [Zhang et al., 1996b]. The authors assumed that the data set they analyze is very large, that most of this data set elements are normal and there are a few intrusions buried within the data set. The other assumption they considered is that anomalies are qualitatively different from the normal instances. The basic idea behind their algorithms is that since anomalies are both different from normal and are rare, they will appear then as outliers in the data where they are buried (see for instance Figure 5.3). The data they considered is KDD 99. These input data elements are mapped into a new feature space. This feature space is typically a real vector space of some high dimension d. After mapping the input data into points in the feature space, the objective of unsupervised anomaly detection consists in detecting points that are distant from most other points or in relatively sparse regions of the feature space. This makes this technique similar to the problem of outlier detection. The goal of the first algorithm (cluster based estimation) consists in computing how many points are closer to each point in the feature space. For each point, the algorithm approximates the density of points near a given point. To realize the approximation, the number of points that are within a sphere of radius ω around the point under consideration is counted (see for instance Figure 5.3). Points that are in a dense region of the feature space and contain many points within the sphere are considered normal whereas points that are in a sparse region of the feature space and contain few points within the sphere are considered as anomalies. The fixed width clustering algorithm works as the following. The first point is considered as the first cluster. For every subsequent point, if it belongs to the ω radius sphere then it is added to the cluster, otherwise it forms a center of a new cluster. Therefore the complexity of this algorithm is O(cn) where c is the number of clusters and n is the number of the data points considered for analysis. This algorithm highly decreases the computation time comparing it to a pairwise comparison of points. To test whether a point is anomalous, it is sufficient to compare the number of points in the cluster where it is located to a given threshold defined by experience. The points that are outliers lay in less dense regions than those of the normal samples. Therefore, the points in dense regions will be higher than the threshold anyway (see for instance Figure 5.3). The second algorithm detects anomalies by using the k-nearest neighbors of each point. If the sum of the distances to the k-nearest neighbors is greater than a threshold, the point is considered as an anomaly. The complexity of pairwise comparison between all the

5.1 Related Work

101

w

attack connection normal connection

Figure 5.3: Normal connections as dense regions and attacks as outliers. considered points is O(n2 ) which is impractical when dealing with a large data set. To decrease the computation time, the authors speeded up the algorithm by using a technique similar to canopy clustering [McCallum et al., 1999]. Canopy clustering is used as a means of breaking down the space into smaller subsets so as to remove the necessity of checking every data point. The data is first clustered using the fixed width clustering and many different clusters are generated. So instead of computing the pairwise distances between all points, the algorithm takes the advantage of the different cluster and calculates only the distance between points and the centers of the different clusters generated during the first step. Using this strategy highly decreases the computation time. The last algorithm is a support vector machine based algorithm [Hearst, 1998] that identifies low support regions of a probability distribution by solving a convex optimization problem [Scholkopf et al., 1964]. The points in the feature space are further mapped into another space using a Gaussian kernel. In this second feature space, a hyperlane is drawn to separate the majority of the points away from the origin. The remaining points represents anomalies. The different experiments were performed using the KDD 99 data and a portion of a system call data which is from the BSM (Basic Security Module) data portion of the DARPA 99 [DARPA 98 and DARPA 99, 2005]. We focus on the experiments performed over KDD 99. Since the unsupervised anomaly detection algorithms described above are sensitive to the ratio of anomalies in the data set and the different data sets of the KDD 99 contain more intrusive instances than normal ones (see Table 4.1), the authors suggest to filter many attacks so that the resulting data set consists of 1% to 1.5% of attack instances and 98.5% to 99% of normal instances. The different results are seemingly interesting. The best true detection rate varies from 93% to 98% and the false positive rate varies from 8% to 10% using the above three algorithms. However, this technique presents many shortcomings. The idea of using an

102

Novel attacks detection

unsupervised algorithm and considering many instances for detecting anomalies is not realistic in the real life. The first reason is that sometimes there is less traffic than other times. For example during the weekdays, there is much more traffic than during the week ends in an enterprise due to the presence of the different people during these days. On the other hand, if we should use this technique in real time we do not know when we should launch the detection. Should we launch it when we have an amount of connections superior to a certain threshold or should we launch it periodically? The set of the collected connections may be either too small or empty and certainly the hypotheses of the three algorithms are not always, or never, true. In addition, if the system is confronted to a denial of service particularly to a flooding attack, the corresponding connections will be detected as normal connections since they will be mapped into a dense region of the feature space. As an example, the new (according to DARPA 98) slammer worm [Moore et al., 2003] that has infected more than 100, 000 MS-SQL servers in less than 10 minutes is similar to the flooding attacks and it will never be detected by an unsupervised learning which assumes that the normal instances outnumber the anomalies ones. Of course, clustering techniques perform well in other fields such as visualization where the considered objects are static and do not considerably differ from one case to another. However, this is not the case in dynamic systems where the assumptions that are true in some cases are wrong in others. In addition, the authors [Eskin et al., 2003] did not present which types of attacks they considered for their experiments. As explained above, the authors assumed that anomalies are qualitatively different from the normal instances. However, in Section 5.4, we show that some attacks in the KDD 99 data sets have the same attributes values as the normal traffic since MADAM/ID system does not follow some necessary conditions. These conditions should be taken into account for transforming raw data (here network traffic) into highly exploitable data sets such as those of KDD 99.

5.1.3

Artificial anomalies for detecting unknown network intrusions

In [Fan et al., 2001], W. Fan et al. studied the problem of building detection models for both pure anomaly detection and combined misuse and anomaly detection. The proposed algorithm generates artificial anomalies to force the inductive learner discover an accurate boundary between known classes (including normal connections and known intrusions) and anomalies. Empirical studies show that their pure anomaly detection model trained using normal and artificial anomalies is able to detect more than 77% of all unknown intrusion classes with more than 50% accuracy per intrusion class. The combined detection models are as accurate as a pure misuse detection model in detecting known intrusions and are able to detect at least 50% of unknown intrusion classes with accuracy measurement between 75% and 100%. They explored the use of traditional inductive learning algorithms for anomaly detection by focusing on the different data sets. They presented a technique for generating artificial anomalies based on known classes to coerce an arbitrary machine algorithm to learn hypotheses that separate all known classes from unknown classes. They discussed the generation of anomaly detection models from pure normal data, and also discuss the generation of combined misuse and anomaly detection models from data that contain both known classes and anomalies. They applied the proposed approaches to network based intrusions. They begin their training data without anomalies examples. The authors suggest to use artificial anomaly generation for the task of pure anomaly detection. Artificial anomalies are injected into the training data to help the learner discover a boundary around the original data. All artificial anomalies are given the class label anomaly. The artificial anomaly

5.1 Related Work

103

generation methods proposed are independent of the learning algorithm as the anomalies are merely added to the training data. The boundary between the known and anomalous instances is assumed to be very close to the existing data. To do so, they introduced an heuristic approach that randomly changes the value of one feature of an example while leaving other features values unaltered. There are some regions that are dense while others are sparse in the instance space. Therefore sparse regions may be grouped into dense ones to produce large regions covering general hypotheses. This type of grouping is performed to prevent these hypotheses from being very general when predicting known classes in the training step. Therefore, it is possible to produce artificial anomalies around the edges of these sparse regions and force the learning algorithm to discover the specific boundaries that distinguish these regions from the rest of instance space. Sparse regions are characterized by infrequent features with respect to whole data. To amplify these regions, more artificial anomalies are generated around these sparse regions according to their sparsity. The proposed algorithm is based on the values of less frequent attributes. It is described as the following. Assume that the value v of some feature is not frequent in the training data set. The difference between the number of occurrences of v, CountV, and the number of occurrences of the most frequent occurring value CountVmax is calculated. Then (countVmax − countV ) data points are randomly sampled from the training data. For each data point in this sampled set, the value of the considered feature f, for example vf , is replaced with a new value v 0 such that v 0 6= v and v 0 6= vf which therefore generates a new anomaly. In order to generate different instances from the training data set, two techniques may be used. The first consists in comparing the new sampled data item with all known instances in the training data set. This first technique is a very expensive process. The second approach consists in using the hypotheses learned on the original data. It is performed as the following. The original data is merged with the new sampled data and the induction algorithm is trained over the merged data set to generate the corresponding model. This model is tested over the artificial anomalies and all the anomalies that are classified as known instances are removed from the new data set. This process is repeated until the size of the artificial anomalies set remains relatively stable. To test their model, the authors used the training data set of the 10% KDD 99 [KDD Cup, 1999]. 80% of this data set is used for the training and generating the artificial anomalies and the remaining 20% is used for the test. Two experiments were performed. The first is a pure anomaly experiment where only normal connections are considered for the training and for generating the artificial anomalies. The second experiment considers simultaneously normal and intrusion instances. The induction learning algorithm chosen for their experiments is RIPPER [Cohen, 1995] —an inductive decision tree learner—. The different results obtained using the small test data set are not very interesting since there are many false alarms in some cases and many attacks remain undetectable. In particular, the probing attacks where the corresponding attacks, namely (ipsweep, nmap, portsweep,satan), are detected with low detection rates (0%,0%,4.81%,0.32%) respectively. There are also many other U2R and R2L attacks that are not detected such as spy and perl. Other R2L and U2R attacks remain undetectable since they are mostly predicted with low detection rates (less than 10%). Another critic to this method resides in the fact that they only use the training data set for both the training and testing steps. However, during the KDD 99 contest, R2L and U2R classes are the two badly classified instances that appear only in the test data set. No example from this test data set has been taken for evaluating their system against those attacks that remain always undetectable. Current supervised machine learning and classification techniques are not written to

104

Novel attacks detection

detect new classes that are not present in the training data set (new profiles that are not seen before in our case). However, as mentioned above, many techniques were performed to detect anomalies either by using unsupervised techniques or by generating artificial anomalies. Unfortunately, some of these techniques consider some hypotheses that are not always realistic to detect all new attacks. Such as the hypothesis of EBayes-TCP that assumes bursts of traffic for detecting new attacks. The unsupervised techniques, on the other hand, filtered the KDD 99 data sets so that the number of normal instances exceeds the number of anomalies and therefore the new attacks based on flooding and probing will never be detected. In addition to this, if an attacker knows about this technique he may perform the same attack several times such that this attack will be assimilated as normal traffic. The last technique that is based on generating artificial anomalies does not generate high prediction rates and assumes that all traffic different from the normal one is anomalous which is not always true. This last method does not answer why all the techniques that are performed over the KDD 99 test data set failed to detect new attacks of type U2R and R2L because they consider only the training data set for both the learning and the testing steps. In the following, we give our motivation for enhancing supervised techniques for anomaly intrusion detection.

5.2

Motivation and problem statement

Anomaly intrusion detection is the first intrusion detection method that was introduced to monitor computer systems by Anderson [Anderson, 1980] in 1980 and then improved by Denning [Denning, 1987] in 1987. At that time, intrusion detection was immature since only user behavior and some system events were taken into account. In fact, this approach consisted in establishing normal behavior profile for user and system activity and observing significant deviations of the actual user activity with respect to the established habitual profile. Significant deviations are flagged as anomalous and should raise suspicion. This definition did not take into account the expert knowledge of known vulnerabilities and then known attacks. This is why we enhance the notion of anomaly detection not only by considering normal profiles but also by taking into account abnormal behaviors that are extracted from known attacks. Since we have knowledge about known vulnerabilities and their corresponding attacks, we may enhance the anomaly detection by adding to the learning step the abnormal behavior corresponding to known attacks. Therefore anomaly detection would consist in learning all known normal and attack profiles (users, applications, system events, network traffic,etc.,). Based on this learnt knowledge, anomaly detection has then to detect whether a new observed profile1 is normal or abnormal and its corresponding known attack is determined or the observed profile is new and therefore is considered as a novel unknown behavior. Thereafter, we suggest that a diagnosis should be done on the observed traffic that has caused the detection of the new anomaly in order to find out the reason of this new observation. Thus, if it corresponds to a new activity that was not seen before it is flagged either as a normal profile or as a new attack. The new observations with their real classification would then be considered for further investigation. We note that the diagnosis of new observed behaviors is not our main objective here. In our knowledge, all the efforts done by different researchers for detecting new attacks in the KDD 99 either consider unrealistic hypotheses or obtain uninteresting results. In the following, we overview the best entry methods for the KDD 99 contest. We recall the 1

A profile of a connection record, of a network traffic, of a user profile, etc.

5.2 Motivation and problem statement

105

different measures that are used to rank the different proposed methods for this task. We use this measure to assess our results discussed in Sections 5.3 and 5.4. The winning entry for KDD 99 contest used decision trees. In fact, many different solutions and machine learning techniques may be used for the KDD99 classification task. We can cite decision trees, bayesian networks, artificial neural networks, K nearest neighbors or any other machine learning technique, etc. We present, here, only two methods that have obtained the highest classification rates in the KDD99 contest; the “Kernel Miner” of LLSoft [Levin, 2000] and the Quinlan decision trees [Pfahringer, 2000]. The “Kernel Miner” method [Levin, 2000] is based upon a global optimization model. This model constructs the set of locally optimal decision trees (decision forest) from which it selects the optimal subset of trees (the subforest) used for predicting new cases. This method presented a successful prediction ratio of 92.92% over all the database. However, the U 2R class (resp. R2L class) is poorly predicted with 11.84% (resp. 7.32%). On the other hand, Pfahringer [Pfahringer, 2000] used decision trees with the “Bagging and Boosting” option added by Quinlan [Quinlan, 1996] to the standard C4.5 decision trees algorithm. This algorithm is called C5 (or boosted and bagged C4.5). We do not use C5 because it is not an open source tool that does not allow us for any modification or improvement. The bagging method consists in constructing t trials from the original instances where some instances in the training set do not appear in it while others appear more than once. The final classifier is formed by aggregating the t classifiers from these trials. The boosting technique consists in maintaining a weight for each instance for each iteration. At each trial, the vector of weights is adjusted to reflect the performance of the corresponding classifier, with the result that the weight of misclassified instances is increased. The final classifier also aggregates the learned classifiers by voting, but each classifier’s vote is proportional to its prediction accuracy. Using the above options, Pfahringer [Pfahringer, 2000] obtained a successful classification ratio of 92.70%. However, the successful classification ratio for the U 2R class (resp. R2L class) remains low 13.2% (resp. 8.4%). To rank the different results a cost matrix C is defined [Elkan, 2000]. Given the cost matrix illustrated in Table 5.1 and the confusion matrix obtained subsequent to an empirical testing process, a cost per test (CPT) is calculated using the formula given in Equation 5.1.

Normal Probing DoS U2R R2L

Normal 0 1 2 3 4

Probing 1 0 1 2 2

DoS 2 2 0 2 2

U2R 2 2 2 0 2

R2L 2 2 2 2 0

Table 5.1: The cost per test matrix.

CP T =

5 5 1 XX Ci,j ∗ CMi,j N

(5.1)

i=1 j=1

where C corresponds to the cost matrix, N is the number of instances in the test data set and CM corresponds to the confusion matrix obtained subsequent to the method that is

106

Novel attacks detection

used in the classification task. The confusion matrix obtained by Pfahringer [Pfahringer, 2000] is then considered as the winning entry with CP T = 0.2331 and the second entry is that of LLSoft [Levin, 2000] with CP T = 0.2356. The other techniques used for this task did not reach the ratios generated by the above two methods. In our knowledge, there is no objective explication given in literature to the low rate successful prediction of the two classes U 2R and R2L. The different techniques which are applied to the KDD99 data sets did not detect unknown attacks as new ones because the methods used for this task are not anomaly based techniques. However, they consisted in learning the signatures of the connections (attacks or normal traffic) using the 41 attributes composing the different connections. Then the new connections are compared to the learning data set using the model constructed during the training phase. We call this technique: misuse detection by learning because it differs from other signature based techniques, such as snort, bro, etc., where only attack signatures are used. On the other hand, it is the role of the expert to write the different rules using the corresponding known vulnerabilities. Based on the above considerations and limitation and the need to explain the failure of supervised techniques to detect U2R and R2L attacks, we investigate two supervised techniques, namely neural networks and decision trees in order to explain the failure of machine learning techniques in the KDD 99 contest. In our experiments, we use the KDD 99 data sets without altering any sample or considering any new sample as described in Table 4.1. The attacks that have any occurrence in the learning set should be detected as known attacks and others —those that are absent in the training set and are present in the test set— are considered as anomalies and should be predicted as new attacks. The default supervised algorithms do not deal with unknown classes. They are interesting since they can generate alarms in real time at the end of a connection by contrast to unsupervised techniques that remain unusable for real time intrusion detection. Separate modules for anomaly and misuse detection may not be as efficient as a single module with the two techniques in the same time. These observations have motivated us to enhance supervised anomaly detection techniques. In the following, we investigate multilayer neural networks based on backpropagation algorithm and enhance decision trees so as to take into account unknown classes and provide the different results obtained over KDD 99 data sets. We motivate our research with neural networks to compare its results with the best entries. We consider them also as a cross validation technique with decision trees for the task of detecting new attacks. While the successful detection rate of the new U2R attacks is increased, in our experiments, the R2L attack remains very low. This suggested us to focus on the transformation done by the MADAM/ID tool and prove that this transformation is not an appropriate one. The low detection rate of new R2L attacks is due to this transformation and not to the enhanced machine learning algorithms particularly the decision trees induction algorithm. We used in Chapter 4 the C4.5 [Quinlan, 1986] decision trees algorithm for misuse detection. Here we use the C4.5rules [Quinlan, 1986], a companion program to C4.5, that reduces the number of rules created from the decision tree during the learning. This algorithm is enhanced to cope with abnormal traffic that corresponds in reality to network traffic. For each of the two algorithms, we give the results obtained before and after their improvement.

5.3 Backpropagation technique for intrusion detection

5.3 5.3.1

107

Backpropagation technique for intrusion detection Background

Backpropagation is a neural network learning algorithm. The field of neural networks was originally investigated by psychologists and neurobiologists who tried to develop artificial neurons by analogy to the human brain. A neural network is a set of connected units following a particular topology. Each neuron is described by a unit that has an input and an output. Two neurons are connected if the output of one of them is connected to the input of the other. Each connection in a neural network has a weight associated to it. The topology of the neural network, the training methodology for weights’ adjustment and the connections between the different neurons define the type of the corresponding neural network. In our study, we are interested in the multilayer neural networks using the backpropagation learning algorithm [Rumelhart et al., 1986]. In a multilayer neural network, there are three kinds of layers. Each layer contains a set of neurons. The first layer, called input layer, sets the activation of its neurons according to the provided pattern in question. The output layer provides the answer of the network. A multilayer network may contain one or many hidden layers although in practice, usually one is used. Figure 5.4 presents a multilayer neural network with one hidden layer.

I0

0

0

O0

k

Ok

m

Om

0

Ii

i

wij Ij j Oj

bj(bias)

In

n

l

Input layer

fictive neuron

Hidden layer

Output layer

Figure 5.4: A retropropagation multilayer neural network. Like a decision tree, a multilayer neural network has two phases. The learning phase where the network learns by adjusting the weights so as to be able to predict the correct class label of the new input patterns during the test phase. Before the training process, one should define the number of hidden layers (if more than one) and the number of neurons on each layer. The number of neurons on the input layer corresponds to the number of attributes that represent a sample. However, input values should be numerical to perform the backpropagation algorithm. Therefore, the discrete values are transformed into a vector as it is explained in Section 4.3.1. Then for each

108

Novel attacks detection

different discrete value of an attribute is assigned a neuron on the input layer. For example, for the protocol type (tcp, udp, icmp), there are three inputs, say I0 , I1 , I2 assigned to this attribute. Each unit is initialized to 0. If the protocol type of the current connection is tcp (resp. udp, icmp) then I0 is set to 1 (resp. I1 is set to 1 and so on). One output unit, on the output layer, may be used to represent exactly one class. So, if the output of a neuron on the output layer is equal to 1 then the corresponding class is designed as the predicted class. The number of hidden layers and the number of units on each hidden layer is established by experience during the training phase since there are no clear rules as to set the best number of hidden layer units. The backpropagation algorithm learns iteratively by processing a set of training samples, comparing the network’s prediction for each sample with the actual known class label. For each training sample, the weights are modified so as to minimize the mean squared error between the network’s prediction and the actual class. The modifications are made in the backwards direction, using a gradient descent method on the output error to modify the weights, in general the weights will eventually converge and the learning process stops. The weight update process of the gradient descent updates each weight in proportion to a learning rate η. Another parameter that is added in practice for the updating process is the momentum which makes the weight update on the nth iteration depend partially on the update that occurred during the (n − 1)th iteration. In the first step of learning, the weights are initialized to small random numbers (generally from −1.0 to 1.0 or −0.5 to 0.5). Each unit has a bias associated with it that acts as a threshold in that it serves to vary the activity of the unit. The use of neural networks in intrusion detection is not new because there are at least two works that were developed during the last decades. The first model is used in Hyperview [Debar et al., 1992] for a user behavior modeling presented is Section 3.1.2. The second one is that discussed in [Cannady, 1998]. This former was used as a misuse detection tool where only packet header attributes are considered for analysis to detecting denial of service and port scan attacks. While these works used neural networks for either user anomaly detection or misuse detection, we use them here for both network misuse and anomaly detection particularly over the different KDD 99 data sets.

5.3.2

Experimental methodology and results

Some parameters of the neural network are known a priori from the provided problem. The number of neurons on the input layer, in our example using the KDD 99 data sets, is equal to 125 units because the discrete attributes among all 41 attributes are translated into continuous ones as in the experiments done in Chapter 4. The number of neurons on the output layer is equal to the number of the total classes corresponding to the five classes considered in the KDD99 contest (normal, probing, DoS, U2R and R2L respectively). Other parameters such as the number of hidden layers, the number of neurons in the hidden layers, the momentum, the learning rate and the number of iterations are determined by experience. In the following, the number of hidden layers we consider in our neural network architecture is limited to only one hidden layer. After performing different tests using one hidden layer, two then three hidden layers we did not obtain a significant improvement in comparison with using only one hidden layer. The momentum is fixed to 0.60 that we fixed after many experiments where this parameter varied over the interval [0.20, 0.90]. The learning rate is fixed to 0.20 after varying it over the interval [0.10, 0.50]. However, the weights values of the different connections in the whole network are randomly initialized in the interval

5.3 Backpropagation technique for intrusion detection

109

[−0.50, 0.50]. Each of the hidden nodes and the output nodes applied a sigmoid transfer function ( 1+e1−x ) to the various connection weights. Since each neuron on the output layer corresponds to one class, the neuron with the highest value defines the predicted class. Using this technique, every sample will be assigned a class among the five classes defined a priori. Figure 5.5 presents the percentage of successful prediction ratio by varying the number of neurons on the hidden layer. One rough guideline for choosing the number of hidden neurons in many problems is the geometric pyramid rule. It states that, for many practical PSP% 93,2 92,8 92,4 92 91,6 91,2 2

3

4

5

6 7 10 Number of neurons

15

20

25

Figure 5.5: PSP variation according to the number of neurons on the hidden layer. networks, the number of neurons follows a pyramid shape, with the number decreasing from the input toward the output layer. The number of neurons on each layer follows a geometric progression. Other researchers [Berkeand & Hajela, 1991] suggested that the nodes on the hidden layer should be between the average and the sum of the nodes on the input and output layers. These are only rough approximations to the ideal hidden layer size. The best approach, suggested by the authors, to find the optimal number of hidden neurons is to start with a few number of neurons, then slightly increase the number of hidden neurons, until no significant improvement is noted. Therefore, we started in our experiments with 2, 3, 4, etc. neurons on the hidden layer and the best result, obtained after many experiments, was by using 5 neurons on the hidden layer that succeeded to reach a successful classification rate of 93.10%. The termination of gradient descent algorithm where the different weights are updated in each epoch may be done using two different techniques. The first consists in fixing a threshold error err min and the learning continues until the error on the training examples falls below this threshold. However, this is a poor strategy because backpropagation is susceptible to overfitting the training examples at the cost of decreasing generalization accuracy over other unseen examples. This first technique may lead the training to an infinite loop. The second method consists in defining the number of iterations a priori. Figure 5.6 presents the percentage of successful prediction over varying the number of iterations. Therefore we started in our experiments by considering 10, 15, 20, . . . iterations and the best percentage of successful prediction is obtained when we used 25 iterations. Table 5.6 presents the confusion matrix related to the best percentage of successful prediction obtained after combining the best parameters of the neural network. We mention, from Table 5.2, that the prediction ratio P SP = 93.10% and the cost per test CP T = 0.2072 outperform all the results of the previous works done over KDD 99. However, the U2R class is undetectable in this result. This is because the number of samples of the U2R class during the learning step is the lowest one (52 instances over 494021).

110

Novel attacks detection

PSP% 93,5 93 92,5 92 91,5 91 10 15 20 25 30 35 40 50 60 70 80 90 Number of Iterations

Figure 5.6: PSP variation according to the considered number of iterations. Predicted Actual Normal(60,593) Probing (4,166) DoS (229,853) U2R (228) R2L (16,189)

%Normal

%Probing

%DoS

97.87 0.75 1.20 10.68 71.63 15.34 2.62 0.36 97.00 86.84 7.02 3.95 73.20 0.06 0.06 P SP = 93.10%, CP T = 0.2072

%U2R

%R2L

0.00 0.00 0.00 0.00 0.00

0.18 2.35 0.02 2.19 26.68

Table 5.2: Confusion matrix when using the backpropagation technique with the best parameters.

Therefore it is difficult to learn this category using neural networks. Our goal does not consist in outperforming previous work done over KDD99 intrusion detection contest. However, we want to understand why all these algorithms fail to detect the last two attack classes, namely U2R and R2L. We also note that the two attacks U2R and R2L are often detected as a normal traffic (86.84% for U2R and 73.20% for R2L) in almost all the techniques that are used for this purpose. There are many R2L and U2R instances that are new (see Table 4.1) in the test data sets since their corresponding attack type is not present in the learning data set. In order to detect these new attacks, a threshold θ is defined. Therefore, if the value of the highest output neuron is below this threshold, the corresponding connection is considered momentarily anomalous however a diagnosis should be performed for further investigation. The diagnosis is not a goal here. Figure 5.7 presents this algorithm. We mention that a similar idea to this is discussed in [Tombini et al., 2004] where the authors combined an anomaly IDS with a misuse IDS for HTTP traffic analysis.

IF (all neurons on the output layer are less than a threshold) THEN The corresponding connection is new A diagnosis should be performed ELSE Let the kth neuron be the most activated one, This connection corresponds to the kth class. FI

Figure 5.7: Classification process using a threshold.

5.3 Backpropagation technique for intrusion detection

111

Figure 5.8 shows the variation of the percentage of successful versus the variation of the a priori fixed threshold θ. PSP% 94,2 94 93,8 93,6 93,4 93,2 93 0,4

0,5

0,6

0,7

0,8

0,9

Threshold value

Figure 5.8: PSP variation according to the considered threshold value. The results shown in Figure 5.8 are performed over the same neural network using the best parameters. We mention that the attacks instances that are predicted as new attacks are considered as a successful prediction ratio. While the whole successful prediction ratio increases, the corresponding prediction ratio of each class decreases respectively. Figure 5.9 shows the different prediction ratios of each class2 when varying the threshold. Normal%

Probing%

97,83

80

97,82

70

97,81

60 50

97,8

40

97,79

30

97,78

20 10

97,77 97,76

0 0,4

0,5

0,6

0,7

0,8

0,9

0,4

Threshold value

0,5

0,6

0,7

0,8

0,9

Threshold value R2L%

DoS% 30

97

25

96,98

20 96,96 15 96,94

10

96,92

5 0

96,9 0,4

0,5

0,6

0,7

0,8

0,9

Threshold value

0,4

0,5

0,6

0,7

0,8

0,9

Threshold value

Figure 5.9: Different classes PSP variation according to the considered threshold value. According to Figure 5.9, even if the threshold is set to 0.90 the two classes DoS and Normal remain detectable in their actual classes. This means that the neural network has 2

The U2R class is not presented since it is not detected even before considering the threshold.

112

Novel attacks detection

correctly learned these two classes. However, the probing and the R2L classes are not predicted in their corresponding actual class when considering the threshold equal to 0.70 for R2L and 0.90 for the Probing attack class. This means that the instances of the test corresponding to these two classes are not well learned or are not close to their corresponding instances in the learning data set. On the other hand, the other two classes DoS and Normal remain detectable with a stable rate even considering a threshold equal to 0.90. Normal(New%)

Probing(New%)

1,2

80

1

70

0,8

60 50

0,6

40

0,4

30

0,2

20 10

0

0 0,4

0,5

0,6

0,7

0,8

0,9

0,4

Threshold value

0,5

0,7

0,8

0,9

Threshold value

DoS(New%)

U2R(New%)

0,6

25

0,5

20

0,4

0,6

15

0,3 10

0,2

5

0,1 0

0 0,4

0,5

0,6

0,7

0,8

0,9

Threshold value

0,4

0,5

0,6

0,7

0,8

0,9

Threshold value

R2L(New%) 35 30 25 20 15 10 5 0 0,4

0,5

0,6

0,7

0,8

0,9

Threshold value

Figure 5.10: Different classes ratios detected as a new class. We report in Figure 5.10 the prediction ratios of the different classes that are detected as new ones. Figure 5.10 shows that while increasing the threshold the two attack classes R2L and Probing are detected as new ones. This means that they are moving from their actual class when no threshold was considered to a new class as if they differ from their real class. However, Figure 5.11 presents the different attack classes that are detected as a normal class while increasing the value of threshold. It is interesting to note that the prediction ratio of these attacks as a normal one remains respectively stable for all of them even if the value of this threshold is equal to 0.90. The two classes U2R and R2L are always detected as normal with rates exceeding 76.75% for U2R and 64.4% for the R2L class. This means that the new instances of these two classes seemingly are close to the instances of the normal connection present in the training set. Although the neural network outperforms all previous works done over KDD99 intrusion detection data sets, it failed to detect the attacks that are not present with low number

5.4 Improving the decision trees for intrusion detection Probing( Normal%)

113

DoS(Normal%)

10,65 10,6 10,55 10,5 10,45 10,4 10,35 10,3 10,25 10,2

2,502 2,5 2,498 2,496 2,494 2,492 2,49 2,488 2,486 2,484 0,4

0,5

0,6

0,7

0,8

0,9

0,4

Threshold value

0,5

0,6

0,7

0,8

0,9

Threshold value R2L(Normal)%

U2R(Normal%) 77,3

69,5

77,2

69

77,1 68,5

77 76,9

68

76,8 76,7

67,5 67

76,6

66,5

76,5 0,4

0,5

0,6

0,7

0,8

0,9

Threshold value

0,4

0,5

0,6

0,7

0,8

0,9

Threshold value

Figure 5.11: Ratios of the different attack classes detected as a normal class. presence in the training data set particularly the U2R attack instances that are almost not predicted in the whole experiments. However, adding a threshold in the test algorithm slightly improves the successful prediction ratio but the attacks that are predicted as a normal class remain detectable as a normal traffic even if the threshold is considered high. This means that the different attack classes are too close to the normal connection records. While the neural networks transform the discrete values of the different attributes into numerical values, the decision trees algorithm works not only with numerical attributes values but also with discrete values. In Section 5.4, we investigate the decision tree induction algorithm to test whether it is possible to detect new attacks, especially the two classes R2L and U2R that remain undetectable. Our goal consists in detecting the last two classes as attacks rather than improving the percentage of the successful prediction ratio of the whole test data. If this is not the case, we should give the reasons why they are always detected as normal connections.

5.4 5.4.1

Improving the decision trees for intrusion detection Background

In Section 4.3.2, we investigated the decision trees induction algorithm C4.5 and used it for misuse detection where the different results are discussed in Section 4.6.2. However, in practice one successful method that is used for finding high accuracy hypotheses is based on pruning the rules issued from the tree constructed during the learning phase. This method is used in the C4.5rules [Quinlan, 1986] that is a companion program to C4.5 that creates rule sets by post-processing decision trees. C4.5rules begins by constructing a rule from each path to a leaf node where each attribute test in the path becomes a conjunct in the

114

Novel attacks detection

rule. The resulting rule is potentially a large number of rules, but the initial set of rules is mutually exclusive. C4.5rules then examines the rules, testing each conjunct to determine whether it is necessary. If rule accuracy is unaffected, the conjunct is deleted. After deleting conjuncts, the resulting rule set is neither mutually exclusive nor exhaustive, so C4.5rules performs several final steps for improving the rule set. Finally, it groups the rules by class, based on the number of false positive errors committed by each class subgroup. After constructing the tree using the C4.5 algorithm, the rule post pruning used in C4.5rules involves the following steps [Mitchell, 1997]: 1. convert the learned tree into an equivalent set of rules by creating one rule for each path from the root to a leaf node, 2. prune each rule by removing any conjuncts that result in improving its estimated accuracy, 3. sort the pruned rules by their estimated accuracy, and consider them in this order for further classification. After the building process during the learning step, each attribute test along the path from the root to the leaf becomes a rule antecedent (precondition) and the classification at the leaf node becomes the rule consequence (postcondition). To illustrate the rule post pruning, let us consider the following rule generated from the tree: IF (protocol type = icmp) ∧ (count> 87) THEN class = smurf Then, each such rule is pruned by removing any antecedent whose removal does not worsen its estimated accuracy. If we consider the above rule, for example, rule post pruning would consider removing the antecedents (protocol type = icmp) and ( count>87). It would select the first antecedent as a first pruning step then consider the estimated rule accuracy after this removal to check whether this step does not decrease the estimation accuracy. Then consider the second precondition as a further pruning step and so on. We note that no pruning step is performed if it reduces the estimated rule accuracy. The C4.5rules calculates its estimate by calculating the rule accuracy over the training examples to which it applies [Quinlan, 1986]. There are many advantages in converting the decision tree to rules before pruning [Mitchell, 1997]: • It allows distinguishing among the different contexts in which a decision node is used. Since each distinct path through the decision tree produces a distinct rule, the pruning decision regarding that attribute test can be made differently for each path. In contrast, if the tree itself were pruned, the only two choices would be to remove completely the decision node or to retain it in its original form, • It removes the distinction between attribute tests that occur near the root of the tree and those that occur near the leaves. Thus, we avoid messy bookkeeping issues such as how to recognize the tree if the root node is pruned while retaining part of the subtree below this test. • It improves readability for humans. In addition to the above advantages cited by Mitchell [Mitchell, 1997], the pruned rules have many advantages in intrusion detection. Since the rules have the "IF ... THEN ..." format, they can be used as a model for a rule based intrusion detection. The different

5.4 Improving the decision trees for intrusion detection

115

C4.5 rules that are generated are concise and intuitive. Therefore, they can be checked and inspected by a security expert for further investigation. We notice that C4.5rules has interesting properties for intrusion detection since it generates a good generalization accuracy. New intrusions may appear after the building process whose forms are quite similar to known attacks that are considered during the building process. Using the generalization accuracy of the rules, new attacks variations could then be detected using the different rules. Real time IDSs require short rules for efficiency. Post pruning the rules generates accurate conditions hence improves the execution time for a real time use of decision in intrusion detection. New instances are classified by testing the different rules based on their attribute values until a match is found where its consequent is designed the class of the new instance.

5.4.2

Improving the classification process

While the rules are efficient for detecting intrusions and their variants, they remain limited to known attacks and normal traffic. This is because the decision trees C4.5 algorithm written by Quinlan [Quinlan, 1986] presents a drawback towards the set of instances that are not covered by any of the rules generated from the decision tree. He proposes a default class for those instances. The default class is defined as the one with most items not covered by any rule. In the case of conflict, ties are resolved in favor of the most frequent class. An example of such a classification is illustrated in Table 5.3. C4.5 rule duration 5 − > class guess passwd protocol type = icmp, src bytes> 333 − > class smurf .. . Default: Normal

Meaning If the duration of the connection is less or equal to 2 seconds and the number of failed logins is greater than 5 then this connection (telnet or rsh) is a guessing password attack. If the protocol type is icmp and the length of the packets coming from the source is greater than 333 then this connection is a smurf attack. .. . If none of the rules matches then the current connection corresponds to a normal one.

Table 5.3: Classification using the post pruned rules.

Using this principle, a default class from the learning data set is assigned to any observed instance that may be a normal connection, known or unknown attack. This solution is interesting if we know that all classes are known a priori and there is no new class different from those that are a priori known. This classification is useful only if it is exclusive; i.e. there is a class for any given instance and the assigned class has at least one instance in the learning data set. Since we are interested in detecting novel attacks this classification kind would not be able to detect new attacks that normally are not covered by any rule from the tree built during the learning step. To overcome this problem, instances that do not have a corresponding class in the training data set are assigned to a default class denoted new class. Therefore, if any new instance does not match any of the rules generated by the decision tree then this instance is classified

116

Novel attacks detection

as a new class instead of assigning it to a default class. Let us call this algorithm enhanced C4.5. To illustrate the effectiveness of this new classification, we conduct, in Section 5.4.3, our experiments on the KDD99 database since it contains many new attacks in the test data set that are not present in the training data set as shown in Table 4.1. On the other hand, we applied this technique to a real traffic in our laboratory network. This traffic contains some new attacks that were not available when DARPA98 was built such as the slammer worm and the different DDoS attacks. These experiments are presented in Section 5.6. This proposal may be generalized to any problem similar to the KDD99 contest that seeks to find new instances in the test data set where some classes should be detected as new ones but not as one of the categories listed in the training data set. The fact that new attacks are not considered is one of the reasons that does not enable the different methods applied over KDD99 contest to predict any new attack.

5.4.3

Experimental Analysis of KDD99

We present the different experiments and results obtained when using the different rules generated from the standard C4.5 algorithm. Applying this algorithm, a default class from the known classes in the training data set is automatically assigned to any new instance that may not be covered by any of the different rules. In the second step, the enhanced C4.5 algorithm, as explained in Section 5.4.2 is used to handle new instances. The accuracy of each experiment is based on the percentage of successful prediction (PSP) on the test data set as presented in Equation (4.6). We also report for each test the cost per test whose value is given in Equation (5.1). Table 5.4 presents the confusion matrix for the 5 classes when using the rules from the decision trees generated by the standard C4.5rules algorithm of Quinlan [Quinlan, 1993]. Predicted Actual Normal(60,593) Probing (4,166) DoS (229,853) U2R (228) R2L (16,189)

%Normal

%Probing

%DoS

99.47 0.40 0.12 18.24 72.73 2.45 2.62 0.06 97.14 82.89 4.39 0.44 81.60 14.85 0.00 P SP = 92.30%, CP T = 0.2342

%U2R

%R2L

0.01 0.00 0.00 7.02 0.70

0.00 6.58 0.18 5.26 2.85

Table 5.4: Confusion Matrix relative using the rules generated by the standard C4.5rules algorithm.

From Table 5.4, the two classes R2L and U2R are badly predicted. On the other hand, many probing and DoS instances are misclassified within the normal category. Most misclassified instances are predicted as normal. This is due to the supervised C4.5rules algorithm that assigns a default class among known classes as explained in Section 5.4.2. We note that the class that has the highest of uncovered instances according to the different pruned rules in the learning data set is the normal class corresponding to the normal traffic. Hence, if a new instance is presented that is different (see for instance definition 5.4.1 below) from all other known normal or abnormal instances in the learning step, it

5.4 Improving the decision trees for intrusion detection

117

is automatically classified as the default class normal since it has the highest number of uncovered instances. Definition 5.4.1. An instance A is different from all other instances present in the training data set, according to the different generated rules, if none of the rules matches this instance. The confusion matrix obtained when we use the enhanced C4.5rules algorithm that considers the default class as a new instance is presented in Table 5.5. Predicted Actual Normal(60,593) Probing (4,166) DoS (229,853) U2R (228) R2L (16,189)

%Normal

%Probing

%DoS

%U2R

99.43 0.40 0.12 0.01 8.19 72.73 2.45 0.00 2.26 0.06 97.14 0.00 21.93 4.39 0.44 7.02 79.41 14.85 0.00 0.70 P SP = (92.30 + 0.57)%, CP T = 0.2228

%R2L

%New

0.00 6,58 0.18 5.26 2.85

0.04 10.06 0.36 60.96 2.20

Table 5.5: Confusion matrix when using the generated rules from the enhanced C4.5 algorithm. By using the enhanced C4.5 algorithm, the detection rate of the U 2R class is increased by 60.96% (corresponding to the httptunnel attack) which decreases the false negative rate of this class from 82.89% (189/228) to 21, 93% (50/228). The detection rate of the Probing class is also enhanced by 10, 06% corresponding to 413 instances which are not classified as a normal traffic but as a new class. We note that the different ratios presented in Table 5.5 are the same as those in Table 5.4 except the normal column where the corresponding ratios have decreased from Table 5.4 to 5.5. This is expected since the normal class is the default class when using the standard C4.5 algorithm. While in the second experiment all the instances that are classified using the default class are classified in the new class. We should mention that the highest ratio for the U2R class has never exceeded 14% according to the different results available in the literature. Using our approach, this attack class is detected as an abnormal traffic with a detection rate of 67.98%. The false positive rate is increased by a small ratio corresponding to 24 instances (0.04%). However, the false negative rate of the R2L class remains stable. In addition, even if we count the detection ratio of the new instances that are classified as new attacks, the PSP (92.30% + 0.57% = 92.87%) ratio remains far from 100%. In addition, the cost per test obtained by our method is much more better than the Pfahringer’s winning entry [Pfahringer, 2000] by performing a CP T 3 = 0.2228. In our knowledge, there is no work in the literature that has exceeded the Pfahringer’s [Pfahringer, 2000] winning entry. Of course, the neural networks discussed in Section 5.3 provided the highest PSP ratio and the best cost per test value. However, the best results produced using neural networks are obtained after many experiments to find out the best parameters. On the other hand, the U2R attack class remains undetectable using neural networks. In the following experiments, we use in the first step the standard C4.5 algorithm. The bad classification of some new instances, described in Table 4.1, are expected since there is always an instance to which this new class would be classified. In the pessimistic case when 3

We consider that the last column of the cost matrix that corresponds to the new attack class is null.

118

Novel attacks detection

all the rules do not fit the new instance because it is different from those instances in the learning phase, it is classified in the default category. The normal class is the default class in the first experiment. Whereas in the second experiment, the enhanced C4.5 algorithm classifies new instances, which are different from those in the training data set, in a new class that was not considered a priori. While the false negative ratio of the U2R class decreased, the false negative ratio of the R2L attacks remains stable. In the following paragraph, we investigate in depth the classification of the different U2R and R2L attacks (as described in Table 4.1). This investigation is necessary to show the robustness of the enhanced C4.5 algorithm. Investigating experiments and discussions over KDD99 In the following, a comparative study between the confusion matrices obtained in two different tests is presented. In the first case we use the default training data set of KDD99 as the training data set and in the second test we use the test data set as the training set. In each test, we examine the percentage of successful prediction (PSP) using the learning data set of each test as a test set. The objective of this analysis is to help us discover whether the two data sets (learning and test data sets) are incoherent. Therefore, the different prediction ratios of the different data sets may help us to find out whether the enhanced C4.5 algorithm we proposed is inefficient or the different KDD 99 data sets present some anomalies such as incoherence. Definition 5.4.2. A database is said coherent if all the training instances characterized by the same attributes’ values belong to the same class. It is said incoherent if there are at least two instances having the same attributes values but different classes. Test 1—The learning data set coherence Let us examine now the matrix shown in Table 5.6 corresponding to the confusion matrix obtained from testing the enhanced C4.5 algorithm over the training data set as a learning and a testing data set. Predicted Actual Normal(97,278) Probing (4,107) DoS (391,458) U2R (52) R2L (1,126)

%Normal 99.94 0.17 0.00 1.92 0.62

%Probing

%DoS

%U2R

%R2L

%New

0.01 0.00 99.78 0.00 0.00 99.99 1.92 0.00 0.00 0.00 P SP = 99.99%

0.00 0.00 0.00 90.39 0.09

0.00 0.00 0.00 0.00 98.93

0.05 0.05 0.01 5.77 0.36

Table 5.6: Confusion matrix obtained using the enhanced C4.5 algorithm on the initial KDD99 learning database. We notice that the different classes are predicted with high rates using the learning database to construct the tree and to generate the different rules. The successful prediction ratio is P SP = 99.99%. However, the lowest prediction ratio is that of the U 2R class because there are not enough instances (52) of this class in the learning set. The enhanced C4.5 algorithm has proven its capacity to classify the least frequent classes, which are not covered by any of the rules generated by the decision tree algorithm, as novel attacks rather than as normal traffic. We note that the results that are provided by the standard C4.5

5.4 Improving the decision trees for intrusion detection

119

algorithm may be obtained from the enhanced C4.5 by adding the column of the new class into the column of default class. The normal class is the default class unless otherwise specified. In the field of supervised machine learning techniques, a method is said powerful if it learns and predicts the different instances of the training set with a low detection error and then generalizes its knowledge to predict the class of new instances. Unfortunately, the C4.5 induction algorithm has efficiently learned the different instances of the training set, according to Table 5.6, but could not classify new instances, for the moment, into their appropriate category according to bad results that are reported in Table 5.4. The confusion matrix presented in table 5.4 shows that the two classes U2R and R2L are badly classified within the normal class. We have expected this result because the standard C4.5 is not designed to detect novel classes that are not present in the training set. We have improved this algorithm to handle these new instances but the R2L class, as showed in table 5.5, remains badly classified. Hence, two cases are possible; either the enhanced C4.5 algorithm failed to detect these new attacks or some KDD99 data are incoherent. If the second case is true, then these data sets should be analyzed to verify their exactitude. The first assumption is not possible because if a new instance is totally distinct from all the instances present in the training data set, it is classified as a new one. This means that there is not any rule issued from the decision tree that can classify it. Otherwise it is not totally distinct and then belongs certainly to a known class. However, the default rule can classify it as a new instance with the enhanced algorithm. In the following, we examine in details the classification of the new instances belonging to the R2L class presented in Table 4.1; namely {named, sendmail, snmpgettattack, snmpguess, worm, xlock, xsnoop}. Table 5.7 presents the confusion matrix corresponding to these new R2L attacks in the test data set. Predicted Actual named (17) sendmail (17) snmpgetattack(7,741) snmpguess (2,406) worm (2) xlock (9) xsnoop (4)

%Normal

%Probing

%DoS

70.59 0.00 0.00 100 0.00 0.00 100 0.00 0.00 99.88 0.04 0.00 100 0.00 0.00 100 0.00 0.00 50.00 0.00 0.00 P SP ' 0.00% (P SP ' 0.00%)

%U2R

%R2L

%New

0.00 0.00 0.00 0.00 0.00 0.00 25.00

0.00 0.00 0.00 0.00 0.00 0.00 25.00

29.41 0.00 0.00 0.08 0.00 0.00 0.00

Table 5.7: Confusion matrix relative to new R2L attacks using the enhanced C4.5 algorithm.

From Table 5.7, there is only one instance of type xsnoop that are classified properly as R2L attacks and another in the U 2R class and one instance of type snmpguess is classified as a probing attack and these are common results of the two algorithms standard C4.5 and enhanced C4.5. However, there are only two instances of type snmpguess that are classified as new attacks and five others of type named. All the remaining instances concerning the new R2L attacks are predicted as normal connections, i.e 10, 186 (resp. 10, 193) using the enhanced C4.5 algorithm (resp. the standard C4.5 algorithm). The false negative rate of the new R2L attacks present in the test data set is about 99.10%

120

Novel attacks detection

(resp. 99.97%) for the enhanced C4.5 algorithm (resp. the standard C4.5 algorithm). These results show that these new R2L connections are not distinct from the normal connections issued after transformation done by MADAM/ID. In the following paragraph we investigate the coherence of the test data base by considering it as a learning data base. We test whether the different rules issued from this data base may predict the same instances of this data base. That is to say whether this data base is coherent or incoherent. Then, we further discuss on the distinction between the normal and R2L attack connections. Test 2—The test data set incoherence In the second test, we invert the two databases. Hence, the learning database consists of 311, 029 connections and the test database contains 494, 021 connections. Using the standard and the enhanced C4.5 algorithms, we obtained the confusion matrix presented in Table 5.8. Predicted Actual Normal(60,593) Probing (4,166) DoS (229,853) U2R (228) R2L (16,189)

%Normal 98.34 0.19 0.01 2.19 36.40

%Probing

%DoS

%U2R

%R2L

%New

0.02 0.03 99.35 0.07 0.00 99.99 0,00 0.00 0,02 0.01 P SP = 97.70%

0.01 0.00 0.00 96.93 0.05

1.50 0.00 0.00 0.00 63.33

0.11 0.38 0.00 0.88 0.19

Table 5.8: Confusion matrix relative to five classes using the rules generated by the enhanced C4.5 algorithm over the learning database of the second test.

Although the percentage of successful prediction rate, from confusion matrix 5.8, is P SP = 97.70%, it is considered very low since it consists in classifying the labeled known instances of the learning data set. This rate is considered very low in the machine learning domain because it could not learn the instances whose classes are known a priori. This means that the C4.5 algorithm failed to learn instances with there appropriate labels. On the other hand, the R2L class is highly misclassified. The classifier has learned only 63.33% from all the R2L labeled instances. Most misclassified R2L instances are predicted as normal connections. This result justifies our observation stated in the first test: i.e. after transformation, the new R2L attacks are not distinct from the normal connections. Table 5.9 presents the confusion matrix of the different R2L attacks of the second test learning data set (311, 029 connections; corresponding to the initial KDD99 test database) using the rules issued from learning this same database. The snmpgetattack type is the most frequent class type present in the R2L category (7, 741/16, 189). The decision rules generated from the decision tree constructed from the second database could not classify 71.85% of snmpgetattack instances that correspond to 5, 562 instances; this presents a high false negative rate. Then, this data set (i.e. the KDD99 test data set) is incoherent. Table 5.10 shows the similarity between the different attributes of the normal connections and the snmpgetattack connections. Therefore, the C4.5 affects the corresponding rules of such connections to the more frequent class.

5.4 Improving the decision trees for intrusion detection Predicted Actual ftp write(3) guess passwd(4,367) imap(1) multihop(18) named(17) phf (2) sendmail(17) snmpgetattack(7,741) snmpguess(2,406) warezmaster(1,602) worm(2) xlock(9) xsnoop(4)

%Normal 33.33 6.92 100 11.11 29.41 100 41.18 71.85 0.17 0.19 100 22,22 25.00

%Probing

%DoS

0.00 0.02 0.00 0.00 11.76 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 P SP = 63.32%

0.00 0.02 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

121

%U2R

%R2L

%New

0.00 0.00 0.00 22.22 11.76 0.00 0.00 0.00 0.00 0.00 0.00 0.09 25.00

33.33 93.02 0.00 38.89 41.18 0.00 58.82 27.96 99.83 99.38 0.00 66.67 0.00

33.33 0.02 0.00 27.78 5.88 0.00 0.00 0.19 0.00 0.44 0.00 11.11 50.00

Table 5.9: Confusion matrix relative to new R2L attacks using the enhanced C4.5 algorithm.

0,udp,snmp,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00, 0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,snmpgetattack. 0,udp,snmp,SF,105,146,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0.00,0.00,0.00,0.00,1.00, 0.00,0.00,255,254,1.00,0.01,0.00,0.00,0.00,0.00,0.00,0.00,normal. Table 5.10: snmpgetattack attack and normal connection similarity.

According to Table 5.8, 1.5% (909) normal instances are predicted as a R2L attacks. On the other hand, almost all snmpguess attacks (99,38%) are detected as R2L attacks during the second test. This is because the number of this attack outnumbers the number of the normal traffic that have the same attributes values as those of snmpguess in the test data set. Since there are similarities between many attack connections and many normal connections, the question one has to ask is why different attacks have the same attributes as those of the normal connections. The corresponding tcpdump traffic of the different attacks is similar to that of normal connections or the transformation done over these data sets is incorrect? In the following, we investigate the two attacks snmpguess and snmpgetattack and show why they are similar to the normal traffic. The new two attacks snmpguess and snmpgetattack that are present only during the two test weeks correspond in reality to an attack scenario [Kendall, 1999]. In this scenario an attacker guesses the SNMP [Case et al., 1990] community password and then remotely monitors router activity. The SNMP password is set to “public” by default, and is often never changed from this default value. In the DARPA98 data sets, the SNMP community password remains by default “public”. In the first day of the first test week, there is an attack that started at 9:00:08, against an internal router of the SNMP community password by sending SNMP requests to that router

122

Novel attacks detection

using different consecutive passwords until receiving a response from that router indicating that the password is correct. This attack is similar to the dictionary attack for password guessing. We should mention that there were more than 30, 000 SNMP requests, in the DARPA98 tcpdump traffic, to find out the correct password. This attack corresponds to snmpguess that is considered as an R2L attack presenting 26.75% (4, 367/16, 189) connections in the R2L class in the 10% KDD 99 test data set. Once the attacker has guessed the password, he may easily monitor the router without being detected. Moreover, this attacker has come back many times to monitor this community during the two test weeks by using the guessed password. The attacker monitoring traffic corresponds to the R2L snmpgetattack in the KDD99 database that presents 47.82% (7, 741/16, 189) of the whole R2L connections in this test database. All instances of snmpgetattack are predicted as normal (within R2L class in Table 5.7). This result is expected and corresponds exactly to the situation presented in the point 2.b in Section 5.5. Indeed, this traffic will be recognized as normal because the attacker logs in as he were a non malicious user since he has guessed the password. However, the snmpguess category should be recognized as a new attack or as a dictionary attack. Unfortunately, there is not any attribute among the 41 attributes to test the SNMP community password in the SNMP request as it is the case with some attributes that verify if it is a root password or a guest password. This is considered only in the case of telnet, rlogin, etc., services. Hence, some interesting information, with which we might have distinguished the traffic, generated by the snmpguess attack with the normal traffic is lost after transformation . This situation corresponds exactly to the necessary condition a transformation function T should satisfy (see point 2.a in Section 5.5). Therefore, this transformation function is poor and some attributes should be added to differentiate some attacks using the dictionary from other traffic. While in Section 4.5 (Chapter 4), we assumed the different features construction done over the tcpdump traffic is free of errors and conducted our experiments over the KDD 99 data sets generated by the MADAM/ID tool, we find that many different connections classes have exactly similar attributes. Therefore, it is difficult to any machine learning to differentiate between these classes. In the following section, we show that KDD 99 is not an appropriate transformation. That is the tool that has generated this data set is not accurate and does not generate coherent data sets.

5.5

Why KDD99 is not an appropriate transformation?

We recall that the intrusion detection database KDD99 is a result of a transformation of a tcpdump traffic, into connection records performed by the MADAM/ID tool [Lee, 1999], presented in Section 4.5.1. However, we should mention that the transformation done in MADAM ID [Lee, 1999] presents some drawbacks due to some limitations of the tools used for this task and the lack of some basic definitions and necessary conditions that must be satisfied by this transformation. In the following, we introduce some definitions and conditions that a good transformation should verify without losing meaningful information from the initial form of the data. The transformation task may be formalized as the following. Let R be the raw data set collected from the network traffic or other sources depending on the environment we are interested in analyzing to discover known or new computer security attacks. We can formalize audit data preprocessing by a transformation function T from the raw data set R to a well featured data item set I. This last data set denotes the whole possible values

5.5 Why KDD99 is not an appropriate transformation?

123

of the different considered features. An item x of I is a vector of the form (v1 , v2 , ..., vn ) where each value vi is either discrete or continuous. Let C = {c1 , c2 , ..., cm } be a set of the different known classes to which a behavior (an item) may fall. The classification function, that we denote F , is then used to assign a class label to an input item vector. What is an appropriate transformation function? 1. The transformation model which consists in transforming the raw data set into their corresponding items in I should be rich enough to distinguish between the different behaviors in the new feature space after transformation. A poor transformation model T may occur when some attribute values are the same in different data items that have different class labels. This means that if we consider ri , rj ∈ R and T (ri ) = xi , T (rj ) = xj where xi , xj ∈ I then if F (ri ) 6= F (rj ) the transformations of ri and rj should be different, i.e xi 6= xj . If this is not the case then the data items that share the same attribute values but have different class labels are considered as noise data and their number must be reduced so that accurate classification models may be learned from I. 2. If two items T (ri ) = xi and T (rj ) = xj , issued from a transformation T , have two distinct classes and have similar values for all considered attributes then two cases are possible: (a) The set of the considered attributes issued from transformation T is not sufficient to characterize and then differentiate them; i.e the transformation function T is poor. Then we should add new attributes that can render this transformation rich, hence the problem is resolved. More formally: Let real class(ri ) be the corresponding class, of ri If ri 6= rj , real class(ri ) 6= real class(rj ) and T (ri ) = T (rj ); then the function T is poor. If this case occurs then the corresponding records present incoherence with the real traffic. Therefore, the number of attributes which is not sufficient should be increased to distinguish the two distinct records in the new feature space. (b) We cannot distinguish between the raw traffic of the two connections ri and rj having two distinct classes. In this case we cannot find a transformation function T that may distinguish the two connections forms in the new feature space. More formally: If ri = rj and real class(ri ) 6= real class(rj ) then @ T such that T (ri ) 6= T (rj ). This last case is possible if we consider a subject b that knows the password of another subject a. The generated data by the intruder b who is using the account of the user a, during a telnet authentication session for example, would not be different from that data generated by the legitimate user a. In this situation, there is no intrusion detection method that can find this intrusion without using additional information. However, a method based on masquerading is possible by analyzing the behavior of this intruder using the learned behavior of the legitimate user a but this is not the main goal here. In addition, the first steps of logging of the intruder that knows the password of a legitimate user is not possible without any a priori knowledge.

124

Novel attacks detection

In KDD 99, there are many attacks that do not satisfy the condition presented in point 2.a above such as the snmpguess that presents many occurrences and is similar to the normal connections. Many other connections will remain undetectable because a password is stolen a priori and these attacks remain undetectable as pointed out in item 2.b that a transformation function should satisfy. Unfortunately, the snmpgetattack attack is present with a high occurrence number in the test data set and is perfectly similar to the normal snmp traffic as described in Table 5.10.

5.6

Other experiments of new attacks detection

We have performed some experiments to verify the efficiency of the enhanced C4.5 algorithm in detecting new attacks that are not present when DARPA 98 was built. The transformation and the different programs (MADAM/ID [Lee, 1999]) done at Columbia University are not available4 . We developed programs that permit to transform the network traffic into connection records but respecting the different rules and conditions that should be taken into account as explained in section 5.5. The new attacks we investigate are those flooding attacks generated by the known DDoS tools such as Trinoo, TFN, TFN2K, etc., used during the year 2000 against many servers over the Internet. The second attack category is the Slammer [Moore et al., 2003] DoS worm that infected thousand vulnerable servers over the Internet in 2003. 1. The distributed denial of service (DDoS) attacks consist in putting out of service any logical or physical resource in a computer system. In general the attacker, using the flooding DDoS attacks, compromises and recruits thousands agents machines that are used to flood the desired victim. The recruited agents, also called slaves, will launch thousand of legitimate packets each against a desired victim. These tools may launch flooding attacks such as syn flooding, udp flooding or smurfing. We used our transformation tool in real time which helps the enhanced C4.5 anomaly tool predict the class of any new connection as described in the previous sections. The 10% KDD99 database is used as the training data set. We have obtained for this experimentation a success of 100% over all the traffic launched by the different DDoS tools against a victim. This traffic was predicted as a DoS attack. This ratio was expected since the DARPA98 contains a variety of DoS attacks type. Actually, all known flooding attacks are present in KDD99; smurf, Neptune (syn-flooding), etc. 2. The Slammer worm [Moore et al., 2003] infected more than 100.000 Microsoft-SQL servers over the Internet in less than 10 minutes. This vulnerability was known at that time, but the different MS-SQL users, including Microsoft, did not patch their software. An attacker sends a UDP packet to UDP/1434 port corresponding to MSSQL service. A vulnerable MS-SQL server that receives this packet will send thousands of the same packets to many other machines. The IP addresses of these machines are randomly calculated and targeting many MS-SQL servers over the Internet. For our tests, we have remotely attacked a vulnerable machine running MS-SQL. This last sends thousands of UDP/1434 packets outside our network. Our tool controls all the traffic outgoing from this machine since it is placed in the same LAN and transforms it into connection records that are classified in real time using the enhanced C4.5 anomaly detection tool. All the traffic generated by this machine is classified as a 4

These programs are licensed to a company who is now developing it commercially.

5.7 Summary

125

DoS attack with a prediction ratio of 100%. We believed the traffic generated by this attack be classified as a new attack, but it was classified as a DoS attack because many connections in the DARPA98 resembles also to this attack. For instance, only one rule corresponding to the neptune attack (syn flooding) has classified all the connections of the slammer traffic as a DoS attack. We could have stopped the slammer worm if our tool were available at that time by launching a counter measure [Cuppens et al., 2006] such as blocking the traffic generated by the vulnerable machine by adding a rule in the firewall connecting this machine to the Internet. We tried to detect the new DDoS and Slammer attacks that were not known when the DARPA98 was constructed, as new attacks. Fortunately, they have been classified as DoS attacks. In reality, the traffic form generated from the DDoS agents is not different from that of DoS traffic which is already present in the DARPA98 database. We mention that there is no signature based IDS that could detect the flooding DDoS traffic without using any further statistical measures as snort does by using the preprocessors [Snort NIDS, 2005]. On the other hand, a signature based IDS can detect the traffic generated by the Slammer worm but after adding appropriate signatures in their database. Here is the advantage of anomaly IDSs that can easily categorize new attacks in their appropriate classes without having any a priori knowledge of the packet content to write the corresponding signature. To adaptively learn new attacks, we have improved our method as the following . If a new connection is detected as new or as a known attack, we add its corresponding connection record in the learning database if there is not any connection that is similar to the current detected attack in the learning set. Then we remake the learning step with the presence of these new attacks in the learning database. This idea allows the C4.5 classifier learn the new attacks in an incremental fashion. However, the new connections that are classified in the new category (see for the moment Table 5.5 particularly the new U2R attacks detected as new traffic) are considered temporarily abnormal for launching an appropriate countermeasure.

5.7

Summary

In this chapter, we presented related works done on anomaly intrusion detection and criticized them since they can be used only when some hypotheses are satisfied. We then investigated two different techniques for anomaly intrusion namely neural networks and decision trees. These two techniques fail to detect new attacks that are not present in the training data set. We slightly improve them for anomaly intrusion detection and test them over the KDD99 data sets and over real network traffic in real time. While the neural networks are very interesting for generalization and very poor for new attacks detection, the decision trees have proven their efficiency in both generalization and new attacks detection. The results obtained with these two techniques outperform the winning entry of the KDD99 data intrusion detection contest. Another interesting point done here is the introduction of the new class to which new instances should be classified for anomaly intrusion detection using supervised machine learning techniques. Since the different MADAM/ID programs [Lee, 1999] are not available and present many shortcomings, we have written the different programs that transform tcpdump traffic into connection records. The objective of our contribution in this chapter is threefold. It first consists in extending the notion of anomaly intrusion detection by considering both normal and known intrusions during the learning step. The second is the necessity to improve machine learning methods by adding a new class into which novel instances should be classified since they should not be classified as

126

Novel attacks detection

any of the known classes present in the learning data set. The third contribution consists in introducing some necessary conditions that should be verified by a rich transformation function. This last point was not taken into account during the transformation of the DARPA98 into KDD99 data sets. As a result many attacks traffic became identical to normal traffic after transformation. We notice that improving the nearest neighbor may be used for anomaly detection by adding a threshold that may be tested in order to verify whether the current connection is known or not as it is used in the backpropagation techniques. However, this technique has a complexity that does not permit its use in real time. This is why we do not investigate in this direction.

Chapter 6

Conclusion This chapter summarizes the thesis and gives some future work that may be investigated based on the different theoretical and experimental results obtained along with the different directions we explored in both anomaly and misuse detection.

6.1

Overview

We first presented some mechanisms that are useful for computer security and shown their application limitation. They use basically access control. Although the access control field has been studied at length and attracted many researchers for many decades, it remains limited since a lot of attacks bypass the different access control mechanisms. Intrusion detection is a second barrier that may be coupled with access control for enforcing computer security. While there are many works done on intrusion detection, much remains to be improved to make current security mechanisms as much resistent as possible. Afterwards, an overview of the different techniques that are developed during the last three decades in intrusion detection are presented. Even if there are two different techniques to intrusion detection; namely misuse and anomaly detection, none of them has succeeded to resolve all security problems a computer system may face. Anderson and Denning introduced the anomaly detection during the eighties. They set some basis for anomaly detection without giving any theoretical and formal foundations for this field which was immature at that time. Research in intrusion detection continued on this issue. In our knowledge, almost all research that has been done over the challenging problem of intrusion detection is not based on a formal design. In fact, all intrusion detection techniques do not have formal reasoning behind. As a matter of fact, only tests are used to verify whether the used techniques work or fail. We gave a background of many detection mechanisms that have been developed or studied by the intrusion detection community. We criticized and gave our point of view for some of them and improved some others. We introduced a method based on principal component analysis for anomaly intrusion detection for both user behavior and network traffic analysis. We then show its effectiveness for space reduction and use it for network traffic. The most comprehensive evaluations of IDSs reported to date are the 1998 and 1999 [DARPA 98 and DARPA 99, 2005] offline evaluations performed by the Lincoln Laboratory at MIT. This research was funded by the DARPA. Some knowledge discovery and knowledge engineering are used in [Lee, 1999] for transforming the different tcpdump traffic into connection records and are made available in

128

Conclusion

[Hettich & Bay, 1999]. The objective of this transformation consists in preparing the data to be used by classification techniques. While many different approaches applied to this transformed data set failed to detect some kinds of attacks and particularly unknown attacks, none of the different techniques has explained the cause of their failure. Moreover, they only compare their techniques to others and express their comparison using the detection ratios that are obtained in each experiment. Since we assumed the error free of these data sets, we also use some classification techniques and improved other for anomaly detection. After experimenting the suggested techniques we find that there is little improvement in successful prediction ratios and some new attacks remain undetectable. This situation suggested us assessing the different data sets. We find out, after proposing some transformation necessary conditions, much interesting information is lost during the transformation done in [Lee, 1999]. Therefore, future research should not use this data set for further investigation without filtering the different data that present incoherence.

6.2

Thesis contributions

We summarize the different contributions of our thesis in the following points: • Principal Component Analysis for anomaly detection We introduced a novel anomaly intrusion detection based on principal component analysis. We thoroughly defined the different steps that are necessary for applying this technique over user profiles. We then use it for network traffic where it shows its robustness for high speed networks. • Principal Component Analysis for space reduction We studied the problem of space reduction that is important in the current networks where many new protocols are emerging. This will certainly face a great problem when analyzing thousands of features. Of course, we experimented the reduction technique for only four dozens of attributes but it remains valid for thousands of attributes. • Combining reduction techniques with machine learning approaches We combined the reduction technique based on principal component analysis with different machine learning for misuse detection. We experimented this technique and find out that the different results remain stable while reducing the space and time consumed after reduction. • Enhancing machine learning techniques for anomaly detection We have modified some supervised techniques to handle new attacks. Particularly, we enhanced decision trees for anomaly detection. We also tested a neural network approach for both anomaly and misuse detection. However, we preferred decision trees because they provide rules that may be assessed by experts to check whether they are effective or not. In addition, one may add rules to those provided automatically by the decision tree induction algorithm. However, the nearest neighbor or the neural network approach particularly does not explain the reason why it performs well. In addition, we cannot forecast the results that may be obtained by neural network since we adjust the different parameters for obtaining good results at each experiment. • Implementing a tool for extracting features from network traffic Since the tool called MADAM/ID [Lee, 1999] is not available, we have implemented a new tool for extracting the different features considered in KDD 99 for intrusion detection evaluation. This engineering task helps us in transforming the different traffic in

6.3 Future work and open issues

129

real time. It has also permitted to evaluate our classification over new attacks that are not known before the construction of the KDD 99 data sets.

6.3

Future work and open issues

There are many future directions that may be added to the work done in this thesis: • User anomaly detection We have studied a technique for anomaly detection where a user profile is considered as the main subject for testing this technique. Our data sets were very simple and restrictive since we only assess our method on issued commands of a UNIX like environment. The other experiment uses the different web pages that are visited by some users during a session. These data sets permitted to assess the effectiveness and the robustness of the proposed technique. However, we think more experiments dealing with other data sets may be tested for further investigation of this technique. • Network anomaly detection We studied one technique for anomaly detection for the network traffic. This technique is performed over a data base that was severely criticized on one hand and which is only tested on one simulated environment. Therefore, these results may not be as good for a practical use as they were for the different experiments done on the different simulated data sets. Although some new attacks were predicted successfully, this does not mean that all new attacks would be detected. This is because the different attributes that are used for evaluation do not correspond to the whole raw traffic. Much more research should be performed in this direction to find out an optimistic approach that may summarize all the needed traffic without information loss. The method based on detecting new attacks we proposed or the others performed by other researchers (see Section 5.1) are only using the DARPA data sets. This is so restrictive since it is not so obvious to detect new attacks if we consider a new environment with new applications that are not considered during the learning step. However, we propose to always build a learning data base where all known intrusions and normal traffic are taken into account in the environment that is considered. • Network anomaly diagnosis After a new anomaly is detected, one may make a diagnosis process to find out the reasons of this anomaly. Existing anomaly detection presents however a serious drawback. It provides a way to detect an abnormal behavior, but no diagnosis on the cause of the anomaly. Hence, we must add a complementary analysis of the reported anomalies, particularly in order to diagnose the presence of a new form of attack (whose signature could then be re-injected in classical signature-based IDS). To our knowledge, there are very few works on this aspect. As a next step after detecting a new anomaly, one should try to build the underlying attack scenario. Every approach based on anomaly detection raises alarms when abnormal behavior is detected but does not provide the security administrator with accurate information about the detected intrusion. This is a drawback of anomaly detection since this kind of information is needed to elaborate the appropriate counter-measures. A future work consists in providing the security administrator with complementary information, in particular about localization of the anomaly, its type and its possible causes. So, the objectives one can follow here are to study several model-based diagnosis techniques and to adapt them to the intrusion detection context.

130

Conclusion

• Hybrid correlation technique Most of the correlation techniques use only alerts from signature based IDSs. As future work, we are investigating the use of anomaly intrusion detection techniques to correlate the different alerts that are generated from the technique we suggested in Chapter 5 with those of misuse detection tools. It may be integrated with the different available tools such as CRIM [Cuppens, 2001a] or CARDS [Yang et al., 2000] and any other explicit or semi explicit correlation tool. Since these tools do not deal with unknown attacks, we are currently investigating their extension to handle these new attacks generated by the new anomaly detection to integrate them in the ongoing correlation attack scenarios. • Anticipating counter measures When we use connection records, the detection process is performed after the connection is finished. However, one interesting idea consists in detecting the intrusion before the end of the connection for preventing harmful results of the attacks. One possible direction of this consists in using a predictable process to estimate the different values of a connection record before the connection finishes. This is a challenging problem we have started to study in order to anticipate the detection process. Possibilistic induction of decision trees [Borgelt et al., 1996] is the first method we plan to implement in the near future to anticipate the detection process. • Building formal basis for intrusion detection There is currently no formal design and basis in intrusion detection. The only direction that is used to assess an intrusion detection technique consists in testing it over a data set. Whether a technique fails or works is not known a priori. If a technique fails, one cannot prove that it is the problem of the technique or the lack of the data. This situation should be changed for example by introducing some specification mechanisms over the different audit trails generated by the different network and operating system services.

6.4

Thesis Summary

This thesis introduced a new technique based on principal component analysis for intrusion detection. It also studied the problem of space reduction where it is necessary particularly for the current emerging protocol and high speed networks. The other objective of this thesis consists in improving machine learning techniques for detecting known and new attacks on-line.

Bibliography [Abbes et al., 2004] Abbes, T., Bouhoula, A., & Rusinowitch, M. (2004). Protocol analysis in intrusion detection using decision trees. In Proccedings of the International Conference on information Technology: Coding and Computing (ITCC’2004). [AbouElKalam et al., 2003] AbouElKalam, A., Baida, R. E., Balbiani, P., Benferhat, S., Cuppens, F., Deswarte, Y., Mi`ege, A., Saurel, C., & Trouessin, G. (2003). Organization Based Access Control. In Proceedings of IEEE 4th International Workshop on Policies for Distributed Systems and Networks (POLICY 2003) (pp. 120–134). Lake Come, Italy. [Anderson, 1980] Anderson, J. P. (1980). Computer Security Threat Monitoring and Surveillance. Technical report, James. P. Anderson Co., Fort Washington, Pennsylvania. [Anderson, 1974] Anderson, T. W. (1974). An Introduction to Multivariate Statistical Analysis. New York, NY: John Wiley and Sons. [Arfken, 1985] Arfken, G. (1985). Lagrange Multipliers,§17.6. In Mathematical Methods for Physicists Orlando, FL: FL: Academic Press. [Asaka et al., 1999] Asaka, M., Tagushi, M., & Goto, S. (1999). The Implementation of IDA: an Intrusion Detection Agent System. In Eleventh annual FIRST Conference on Computer Security Incident Handling and Response (FIRST’99). [Autrel & Cuppens, 2005] Autrel, F. & Cuppens, F. (2005). Using an intrusion detection alert similarity operator to aggregate and fuse alerts. In Proccedings S´ecurit´e et Archtiecture des R´eseaux(SAR’2005). [Badger et al., 1995] Badger, L., Sterne, D. F., Sherman, D. L., & Walker, K. M. (1995). Practical Domain and Type Enforcement for UNIX. In Proceedings of the IEEE Symposium on Security and Privacy (pp. 66–77). [Badger et al., 1996] Badger, L., Sterne, D. F., Sherman, D. L., & Walker, K. M. (1996). Confining Root Programs with Domain and Type Enforcement (DTE). In 6th USENIX UNIX Security Symposium San Jose, CA. [Bay, 1998] Bay, S. D. (1998). Combining Nearest Neighbor Classifiers Through Multiple Feature Subsets. In Proceedings of the 15th International Conference on Machine Learning (pp. 37–45). San Francisco, CA: Morgan Kaufmann. [Bell & LaPadula, 1976] Bell, D. E. & LaPadula, L. J. (1976). Secure Computer Systems: Unified Exposition and Multics Interpretation. Technical Report ESD-TR-75-306, MTR-2997, Rev. 1, MITRE Corporation, Bedfort, MA. [BenAmor et al., 2004] BenAmor, N., Benferhat, S., & ElOuedi, Z. (2004). Naive Bayes vs Decision Trees in Intrusion Detection Systems. In The 19th ACM Symposium On Applied Computing SAC 2004 Nicosia, Cyprus. [Berkeand & Hajela, 1991] Berkeand, L. & Hajela, P. (1991). Application of neural nets in structural optimisation. NATO/AGARD Advanced Study Institute,, 23(I-II), 731–745. [Biba, 1975] Biba, K. J. (1975). Integrity consideration for secure computer systems. Number MTR3153.

132

BIBLIOGRAPHY

[Boebert & Kain, 1985] Boebert, W. E. & Kain, R. Y. (1985). A Practical Alternative to Hierarchical Integrity Policies. In 8th National Computer security Conference (pp. 18–27). Gaithersburg, MD. [Borgelt et al., 1996] Borgelt, C., Gebhardt, J., & Kruse, R. (1996). Concepts for Probabilistic and Possibilistic Induction of Decision Trees on Real World Data. In Proceedings of the 4th European Congress on Intelligent Techniques and Soft Computing (EUFIT’96) Aachen, Germany. [Bouzida & Cuppens, 2006] Bouzida, Y. & Cuppens, F. (2006). Detecting Known and Novel Network Intrusions. In 21st IFIP International Information Security Conference (SEC’2006) (pp. 258–270). Karlstad, Sweden: Springer Publishers. [Bouzida et al., 2006] Bouzida, Y., Cuppens, F., & Gombault, S. (2006). Detecting and Reacting Against Distributed Denial of Service Attacks. In IEEE International Conference on Communications (ICC’2006). [Bouzida & Gombault, 2003a] Bouzida, Y. & Gombault, S. (2003a). An Efficient Method to Intrusion Detection. In Proceedings of the 1st International Conference on Sciences of Electronic Technologies of Information and Telecommunications (’SETIT 2003’) Sousse, Tunisia. [Bouzida & Gombault, 2003b] Bouzida, Y. & Gombault, S. (2003b). Intrusion Detection Using Principal Component Analysis. In Proceedings of the 7th World Multiconference on Systemics, Cybernetics and Informatics Orlando, Florida. [Bouzida & Gombault, 2004] Bouzida, Y. & Gombault, S. (2004). Eigenconnections to Intrusion Detection. In 19th IFIP International Information Security Conference (SEC’2004) (pp. 241– 258). Toulouse, France: Kluwer Academic Publishers. [Breiman et al., 1984] Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and Regression Trees. [C2BSM, 1991] C2BSM (1991). Sun Microsystems, Inc. Installing, Administering, and Using the Basic Security Module. 2550 Garcia Ave., Mountain View, CA 94043. [Cannady, 1998] Cannady, J. (1998). Artificial Neural Networks for Misuse Detection. In Proceedings of the 1998 National Information Systems Security Conference (NISSC’98) Arlington, VA, USA. [Case et al., 1990] Case, J., Fedor, M., Schoffstall, M., & Davin, J. (1990). Simple Network Management Protocol (SNMP). Available at: http://www.ietf.org/rfc/rfc1157.txt. [Center, 1987] Center, N. C. S. (1987). A Guide to Understanding Discretionary Access Control in Trusted Systems. [CERT, 1999] CERT (1999). Results of the distributed-systems intruder tools workshop. Available at: http://www.cert.org/reports/dsit_workshop.pdf. [CERT Coordination Center, 2005] CERT Coordination Center (2005). Computer Emergency Response Team-CERT. Available at: http://www.cert.org. [Cohen, 1995] Cohen, W. (1995). Fast effective rule induction. In Twelfth International Conference on Machine Learning (ICML’1995) (pp. 115–123).: Morgan Kaufman. [Coit et al., 2001] Coit, C. J., Staniford, S., & McAlerney, J. (2001). Towards Faster String Matching for Intrusion Detection or Exceeding the Speed of Snort. Technical report, Silicon Defense. [Computer-Associates, 2000] Computer-Associates (2000). E-Trust Intrusion Detection. Available at: http://www.ca.com. [Connor & Atlas, 1991] Connor, J. & Atlas, L. E. (1991). Recurrent Neural Networks and Time Series Prediction. In International Joint Conference on Neural Networks (IJCNN’91) (pp. 301– 306). Seattle, WA. [Cuppens, 2001a] Cuppens, F. (2001a). Cooperative intrusion detection. In International Sysmposium on Information superiority: tools for crisis and conflict-management Paris, France. [Cuppens, 2001b] Cuppens, F. (2001b). Managing Alerts in a Multi-Intrusion Detection Environment. In 17th Annual Computer Security Applications Conference New-Orleans New-Orleans, USA.

BIBLIOGRAPHY

133

[Cuppens et al., 2006] Cuppens, F., Autrel, F., Bouzida, Y., Garcia, J., Gombault, S., & Sans, T. (2006). Anti-correlation as a criterion to select appropriate counter-measures in an intrusion detection framework. Annales des t´el´ecommunications, 61(1-2), 197–217. [Cuppens et al., 2002] Cuppens, F., Autrel, F., Mi`ege, A., & Benferhat, S. (2002). Recognizing malicious intention in an intrusion detection process. In Second International Conference on Hybrid Intelligent Systems (HIS’2002) (pp. 806–817). Santiago, Chile. [Cuppens & Cuppens, 2005] Cuppens, F. & Cuppens, N. (2005). High Level Conflict Management in the Or-BAC model. Technical report, ENST Bretagne. [Cuppens et al., 2004] Cuppens, F., Gombault, S., & Sans, T. (2004). Selecting Appropriate CouterMeasures in an Intrusion Detection Framework. In 17th IEEE Computer Security Foundations Workshop (pp. 78–87). Pacific Grove, CA: IEEE Computer Security. [Cuppens & Mi`ege, 2002] Cuppens, F. & Mi`ege, A. (2002). Alert Correlation in a Cooperative Intrusion Detection Framework. In IEEE Symposium on Security and Privacy Oakland, USA. [Cuppens & Mi`ege, 2003a] Cuppens, F. & Mi`ege, A. (2003a). Administration Model for Or-BAC. In International Federated Conferences (OTM’03), Workshop on Metadata for Security (pp. 754– 768). Catania, Sicily, Italy. [Cuppens & Mi`ege, 2003b] Cuppens, F. & Mi`ege, A. (2003b). Modelling contexts in the Or-BAC model. In Proceedings of the 19th Annual Computer Security Applications Conference (ACSAC 2003) (pp. 416–427). Las Vegas, Nevada, USA. [Cuppens & Ortalo, 2000] Cuppens, F. & Ortalo, R. (2000). LAMBDA: A Language to Model a Database for Detection of Attacks. In Third International Workshop on the Recent Advances in Intrusion Detection (RAID’2000) Toulouse, France. [DARPA 98 and DARPA 99, 2005] DARPA 98 and DARPA 99 (2005). DARPA Intrusion Detection Evaluation. Available at: http://www.ll.mit.edu/IST/ideval/data/dataindex.html. [Dasarathy, 1991] Dasarathy, B. V. (1991). A Computational Demand Optimization Aide for Nearest-Neighbor-Based Decision systems. In IEEE International Conference on Systems, Man and Cybernetics (pp. 1777–1782).: Morgan Kaufmann. [Debar et al., 1992] Debar, H., Becker, M., & Siboni, D. (1992). A neural network component for an intrusion detection system. In Proceedings of the 1992 IEEE Symposium On Research in Computer Security and Privacy Oakland, CA. [Debar et al., 2005] Debar, H., Curry, D., & Feinstein, B. (2005). The Intrusion Detection Message Exchange Format, Internet Draft. Available at: http://www.ietf.org/internet-drafts/ draft-ietf-idwg-idmef-xml-14.txt. [Debar et al., 1998] Debar, H., Dacier, M., & Wespi, M. A. (1998). Fixed vs. Variable-Length Patterns for detecting Suspecious Process Behavior. In Proceedings of the 5th European Symposium on Research in Computer Security (pp. 1–15). Louvain-la-Neuve, Belgium. [Debar & Wespi, 2001] Debar, H. & Wespi, A. (2001). Aggregation and Correlation of IntrusionDetection Alerts. In Fourth International Workshop on the Recent Advances in Intrusion Detection (RAID’2001) (pp. 87–105). Davis, USA. [Denning, 1987] Denning, D. (1987). An Intrusion Detection Model. IEEE Transactions on Software Engineering, 13(2), 222–232. [Denning & Neumann, 1985] Denning, D. E. & Neumann, P. G. (1985). Requirements and model for IDES, a real time intrusion detection expert system. Technical report, Computer Science Laboratory, SRI International. [Dousson, 1994] Dousson, C. (1994). Suivi d’Evolutions et Reconnaissance de Chroniques. PhD Thesis. [Elkan, 2000] Elkan, C. (2000). Results of the KDD’99 Classifier Learning. ACM SIGKDD, 1, 63–64.

134

BIBLIOGRAPHY

[Emran & Ye, 2001] Emran, S. M. & Ye, N. (2001). Robustness of Canberra Metric in Computer Intrusion Detection. In Proceedings of the 2001 IEEE Workshop on Information Assurance and Security (pp. 80–84). United States Military Academy, West Point, NY. [Eskin et al., 2003] Eskin, E., Arnold, A., Prerau, M., Portnoy, L., & Stolfo, S. (2003). A Geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. Applications of Data Mining in Computer Security, Kluwer Publishers. [Eskin et al., 2001] Eskin, E., Lee, W., & Stolfo, S. J. (2001). Modeling system calls for intrusion detection with dynamic window sizes. In DARPA Information Survivabilty Conference and Exposition II (DISCEX II) (pp. Anaheim, CA). [Fan et al., 2001] Fan, W., Miller, M., Stolfo, S. J., Lee, W., & Chan, P. K. (2001). Using Artificial Anomalies to Detect Unknown and Known Network Intrusions. In Proceedings of the 2001 IEEE International Conference on Data Mining (pp. 123–130). San Jose, CA, USA. [Fan et al., 2004] Fan, W., Miller, M., Stolfo, S. J., Lee, W., & Chan, P. K. (2004). Using artificial anomalies to detect unknown and known network intrusions. Knowledge and Information Systems, 6(5), 507–527. [Fayyad et al., 1996a] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996a). From data mining to knowledge discovery: An overview. Advances in Knowledge Discovery and data mining, AAAI/MIT Press. [Fayyad et al., 1996b] Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996b). The KDD process for extracting useful knowledge from volumes of data. ACM Press, 39(11), 27–34. [Fayyad, 1991] Fayyad, U. M. (1991). on the induction of decision trees for multiple concept learning. PhD Thesis. [Ferraiolo & Kuhn, 1992] Ferraiolo, D. F. & Kuhn, R. (1992). Role-Based Access Controls. In Z. Ruthberg & W. Polk (Eds.), Proceedings of the 15th NIST-NSA National Computer Security Conference (pp. 554–563). Baltimore, MD. [Fisk & Varghese, 2001] Fisk, M. & Varghese, G. (2001). Applying Fast String Matching to Intrusion Detection. Technical Report CS2001-0670, Los Alamos National Laboratory. [Fix & Hodges, 1951] Fix, E. & Hodges, J. L. (1951). Discriminatory analysis: Nonparametric discrimination: Consistency properties. Technical Report 21-49-004, USAF School of Aviation Medecine, Randolf Field, Texas. [Forrest et al., 1996] Forrest, S., Hofmeyr, S. A., Somayaji, A., & Longstaff, T. A. (1996). A Sense of Self for Unix Processes. In IEEE Symposium on Security and Privacy Oakland, USA. [Garcia et al., 2004] Garcia, J., Autrel, F., Borrel, J., Bouzida, Y., Castillo, S., Cuppens, F., & Navarro, G. (2004). Preventing coordinated attacks via alert correlation. In Proceedings of the Ninth Nordic Workshop on Secure IT Systems Encouraging Cooperation (NORDSEC’2004). [Goldberg, 1991] Goldberg, D. E. (1991). Genetic Algorithms in Research, Optimization and Machine Learning. USA: Addison Wesley. [Guha et al., 1998] Guha, S., Rastogi, R., & Shim, K. (1998). CURE: an efficient clustering algorithm for large databases. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. [Han & Kamber, 2001] Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers. [Heady et al., 1990] Heady, R., Luger, G., Maccabe, A., & Servilla, M. (1990). An Architecture of a Network Level Intrusion Detection System. Technical report, Department of Computer Science University of New Mexico. [Hearst, 1998] Hearst, M. A. (1998). Support Vector Machines. IEEE Intelligent Systems, 13(4), 18–28. [Hettich & Bay, 1999] Hettich, S. & Bay, S. D. (1999). The UCI KDD Archive. Available at: http://kdd.ics.uci.edu/.

BIBLIOGRAPHY

135

[Hinneburg & Keim, 1999] Hinneburg, A. & Keim, D. A. (1999). Clustering methods for large databases: From the past to the future. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (pp. 80–84). Philadelphia, PA. [Hochberg et al., 1993] Hochberg, J., Jackson, K., Stallings, C., McClary, J. F., DuBois, D., & Ford, J. (1993). Nadir: An automated system for detecting network intrusion and misuse. Computers and Security. [Horspool, 1980] Horspool, R. N. (1980). Practical fast searching in strings. Software Practice and Experimence, 10(6), 501–506. [Hotelling, 1933] Hotelling, H. (1933). Analysis of a complex statistical variables into principal components. Journal of Educational Psychology, 24, 417–441. [Hunt, 1962] Hunt, E. B. (1962). Concept Learning: An Information Processing Problem. Wiley. [Ilgun, 1992] Ilgun, K. (1992). USTAT: A Real-Time Intrusion Detection System for UNIX. Master’s Thesis. [Ilgun, 1993] Ilgun, K. (1993). USTAT, a real time intrusion detection system for UNIX. In IEEE Symposium on Security and Privacy (pp. 16–28). Oakland, CA. [Jakobson & Weissman, 1993] Jakobson, G. & Weissman, M. D. (1993). Alarm Correlation. IEEE Network Magazine, (pp. 52–59). [Javitz & Valdes, 1991] Javitz, H. S. & Valdes, A. (1991). The SRI IDES statistical anomaly detector. In IEEE Symposium on Security and Privacy Oakland, USA. [Javitz & Valdes, 1993] Javitz, H. S. & Valdes, A. (1993). The NIDES statistical component: description and justification. Technical report, Computer Science Laboratory, SRI International. [Jensen, 1997] Jensen, K. (1997). Coloured Petri Nets. Basic Concepts, Analysis Methods and Practical Use, volume 1. Springer Verlag, second edition. [Jolliffe, 2002] Jolliffe, I. T. (2002). Principal Component Analysis. New York, NY: Springer Verlag, third edition. [Julisch, 2001] Julisch, K. (2001). Mining Alarm Clusters to Improve Alarm Handling Efficiency. In 17th Annual Computer Security Applications Conference New-Orleans New-Orleans, USA. [Julisch, 2002] Julisch, K. (2002). Using Root Cause Analysis to Handle Intrusion Detection Alarms. In ACM journal (pp. 111–136). [KDD Cup, 1999] KDD Cup (1999). KDD Cup 99 Intrusion Detection Datasets. Available at: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html. [KDD Task, 1999] KDD Task (1999). KDD 99 Task. Available at: http://kdd.ics.uci.edu/ databases/kddcup99/task.html. [Kendall, 1999] Kendall, K. (1999). A Database of Computer Attacks for the Evaluation Of Intrusion Detection Systems. [Kirby & Sirovich, 1990] Kirby, M. & Sirovich, L. (1990). Application of the KarhunenLoeve Procedure for the Characterization of Human Faces. IEEE Transactions On Pattern Analysis and Machine Intelligence, 12(1), 103–107. [Kumar, 1995] Kumar, S. (1995). Classification and Detection of Computer Intrusions. PhD Thesis. [Kumar & Spafford, 1994] Kumar, S. & Spafford, E. (1994). A pattern matching model for misuse intrusion detection. In Proceedings of the 17th National Computer security Conference (pp. 11–21). [Lee, 1999] Lee, W. (1999). A Data Mining Framework for Constructing Features and Models for Intrusion Detection Systems. PhD Thesis. [Lee & Stolfo, 2000] Lee, W. & Stolfo, S. (2000). A Framework for Constructing Features and Models for Intrusion Detection Systems. ACM Transactions on Information and System Security, 3(4).

136

BIBLIOGRAPHY

[Lee et al., 1999] Lee, W., Stolfo, S. J., & Mok, K. (1999). Mining in a data flow environment: Experience in intrusion detection. In Proceeding of the 1999 Conference on Knowledge Discovery and Data Mining KDD-99. [Levin, 2000] Levin, I. (2000). KDD-99 Classifier Learning Contest LLSoft’s Results Overview. SIGKDD Explorations. ACM SIGKDD, 1, 67–71. [Li & Mitchell, 2003] Li, N. & Mitchell, J. C. (2003). Datalog with Constraints: A Foundation for Trust Management Languages. In Proceedings of the 5th International Symposium on Practical Aspects of Declarative Languages (PADL 03) New Orleans, LA, USA. [Lindqvist & Porras, 1999] Lindqvist, U. & Porras, P. A. (1999). Detecting computer and network misuse through the production-based expert system toolset (p-BEST). In IEEE Symposium on Security and Privacy (pp. 146–161). [M´e, 1994] M´e, L. (1994). Audit de S´ecurit´e par Algorithmes G´en´etiques. PhD Thesis. [Mahoney & Chan, 2003] Mahoney, M. V. & Chan, P. K. (2003). An Analysis of the 1999 DARPA/Lincoln Laboratory Evaluation Data for Network Anomaly Detection. In Sixth International Symposium on the Recent Advances in Intrusion Detection (RAID’2003) (pp. 220–237). Pittsburgh, PA. [McCallum et al., 1999] McCallum, A., Nigam, K., & Ungar, L. H. (1999). Efficient clustering of high-dimensional data sets with application to reference matching. In Proceeding of the 2000 Conference on Knowledge Discovery and Data Mining KDD-99 (pp. 169–178). [McHugh, 2000] McHugh, J. (2000). Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory. ACM Transactions on Information and System Security (TISSEC), 3(4), 262–294. [Michel & M´e, 2001] Michel, C. & M´e, L. (2001). ADeLe: An Attack Description Language for Knowledge-Based Intrusion Detection. In Trusted Information: The New Decade Challenge, IFIP TC11, Sixteenth Annual Working Conference on Information Security Paris, France: Kluwer. [Mitchell, 1997] Mitchell, T. M. (1997). Machine Learning. McGraw Hill. [Moore et al., 2003] Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., & Weaver, N. (2003). Inside the Slammer Worm. IEEE Security and Privacy, 1(4). [Morin & Debar, 2003] Morin, B. & Debar, H. (2003). Correlation of Intrusion Symptoms: An Application of Chronicles. In Sixth International Symposium on the Recent Advances in Intrusion Detection (RAID’2003) (pp. 94–112). Pittsburgh, PA. [Ning et al., 2002] Ning, P., Cui, Y., & Reeves, D. (2002). Constructing Attack Scenarios Through Correlation of Intrusion Alerts. In proceedings of the 9th ACM conference on Computer and communication security (pp. 245–254). Washington DC, USA. [One, 1989] One, A. (1989). Smashing The Stack For Fun And Profit. [Oostandorp et al., 2000] Oostandorp, K., Badger, L., Vance, C. D., Morrison, W. G., Petkac, M. J., Sherman, L., & Sterne, D. F. (2000). Domain and Type Enforcement Firewalls. In DARPA Information Survivability Conference and Exposition. [Paxson, 1999] Paxson, V. (1999). Bro: A system for detecting network intruders in real-time. International Journal of Computer and Telecommunications Networking Archive, 31(23-24). [Pearl, 1988] Pearl, J. (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference. San Francisco, CA: Morgan-Kaufmann Series In Representation And Reasoning. [Pfahringer, 2000] Pfahringer, B. (2000). Winning the KDD Classification Cup: Bagged Boosting. SIGKDD Explorations. ACM SIGKDD, 1, 65–66. [Porras & Neumann, 1997] Porras, P. & Neumann, P. (1997). EMERALD: Event Monitoring Enabling Responses to Anomalous Live Disturbances. In Proceedings of the National Computer Security Systems Conference (pp. 719–729). Toulouse, France.

BIBLIOGRAPHY

137

[Qin & Lee, 2003] Qin, Z. & Lee, W. (2003). Statistical Causality Analysis of INFOSEC Alert DATA. In Sixth International Symposium on the Recent Advances in Intrusion Detection (RAID’2003) Pittsburgh, PA. [Queiroz et al., 1999] Queiroz, J. D., Carmo, L., & Pirmez, L. (1999). Micael: An autonomous mobile agent system to protect new generation networked applications. In Second International Workshop on the Recent Advances in Intrusion Detection (RAID’1999) UC Davis, CA. [Quinlan, 1986] Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 1–106. [Quinlan, 1993] Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann Publishers. [Quinlan, 1996] Quinlan, J. R. (1996). Bagging, Boosting, and C4.5. In Proceedings of the Thirteenth National Conference on Artificial Intelliegnce (pp. 725–730). [Roesch, 1999] Roesch, M. (1999). Snort - Lightweight Intrusion Detection for Networks. In 13th Systems Administration Conference - LISA 99. [Rumelhart et al., 1986] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. [Schnell, 1964] Schnell, P. (1964). A method for discovering data-groups. Biometrica, 6, 47–48. [Scholkopf et al., 1964] Scholkopf, B., Platt, J. C., ShaweTaylor, J., & Smola, A. J. (1964). Estimating the Support of a High-Dimensional Distribution. Biometrica, 13(7), 47–48. [Sedgewick, 1997] Sedgewick, R. (1997). Algorithms in C: Fundamentals, Data Structures, Sorting, Searching. New York, NY: Addison-Wesely Publishing Company. [Shyu et al., 2003] Shyu, M. L., Chen, S. C., Sarinnapakorn, K., & Chang, L. W. (2003). A Novel Anomaly Detection Scheme Based on Principal Component Classifier. In Proceedings of ICDM Foundation and New Direction of Data Mining workshop (pp. 172–179). [Smalley & Fraser, 2001] Smalley, S. & Fraser, T. (2001). A Security Policy Configuration for the Security-Enhanced Linux. Technical Report 21-49-004, NAI LABs, Randolf Field, Texas. [Snapp et al., 1991] Snapp, S. R., Brentano, J., Dias, G. V., Goan, T. L., Heberlein, L. T., Ho, C. L., Levitt, K. N., Mukherjee, B., Smaha, S. E., Grance, T., Teal, D. M., & Mansur, D. (1991). Dids (distributed intrusion detection system) motivation, architecture, and an early prototype. In Proceedings of the 14th National Computer Security Conference. [Snort NIDS, 2005] Snort NIDS (2005). Snort Network Intrusion Detection System. Available at: http://www.snort.org. [Spafford & Zamboni, 2000] Spafford, E. H. & Zamboni, D. (2000). Intrusion detection using autonomous agents. Computer Networks, 34, 547–570. [Squid cache, 2005] Squid cache (2005). Squid Web Proxy Cache. Available at: http://www. squid-cache.org. [Staniford-Chen et al., 1996] Staniford-Chen, S., Crawford, R., Dilger, M., Frank, J., Hoagland, J., Levitt, K., & Zerkle, D. (1996). Grids a graph-based intrusion detection system for large networks. In Proceedings of the 19th National Information Systems Security Conference. [Thomsen, 1990] Thomsen, D. J. (1990). Role-based Application Design and Enforcement. In 4th IFIP Workshop on Database Security Halifax, England. [Thomsen, 1995] Thomsen, D. J. (1995). Sidewinder: Combining Type Enforcement and UNIX. In 11th Computer Security Applications Conference Orlando, FL. [Tombini et al., 2004] Tombini, E., H., Debar, M´e, L., & Ducass´e, M. (2004). A Serial Combination of Anomaly and Misuse IDSes Applied to HTTP Traffic. In Proceedings of the 20th Annual Computer Security Applications Conference (ACSAC 2004) Tucson, Arizona, USA. [Turk & Pentland, 1991] Turk, M. & Pentland, A. (1991). Eigenfaces for Recognition. Cognitive Neuroscience, 13(1), 71–96.

138

BIBLIOGRAPHY

[Valdes & Skinner, 1998] Valdes, A. & Skinner, K. (1998). EMERALD TCP Statistical Analyzer 1998 Evaluation Results. Available at: http://www.sdl.sri.com/emerald/98-eval-estat/ index.html. [Valdes & Skinner, 2000] Valdes, A. & Skinner, K. (2000). Adaptive, Model-Based Monitoring for Cyber Attack Detection. In Recent Advances in Intrusion Detection, 5th International Symposium, RAID 2000 (pp. 80–92). Toulouse, France. [Valdes & Skinner, 2001] Valdes, A. & Skinner, K. (2001). Probabilistic Alert Correlation. In Fourth International Workshop on the Recent Advances in Intrusion Detection (RAID’2001) (pp. 54–68). Davis, USA. [Vigna & Kemmerer, 1999] Vigna, F. & Kemmerer, R. A. (1999). Netstat: A network based intrusion detection system. Journal of Computer Security, 7(1), 37–71. [Wang et al., 1997] Wang, W., Yang, J., & Muntz, R. (1997). STING : A Statistical Information Grid Approach to Spatial Data Mining. In International Conference on Very Large Data Bases Athens, Greece. [Yang et al., 2000] Yang, J., Ning, P., Wang, X. S., & Jajodia, S. (2000). CARDS: A Distributed System for Detecting Coordinated Attacks. In Proceedings of IFIP TC11 Sixteenth Annual Working Conference on Information Security (pp. 171–180). Washington DC, USA. [Za¨ıane et al., 1998] Za¨ıane, O. R., Xin, M., & Han, J. (1998). Discovering Web Access Patterns and trends by Applying OLAP and Data Minig Technology on Web Logs. In Advances in Digital Libraries (ADL’1998) Santa Barbara, CA. [Zhang et al., 1996a] Zhang, T., Ramakrishnan, R., & Livny, M. (1996a). BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data. [Zhang et al., 1996b] Zhang, T., Ramakrishnan, R., & Livny, M. (1996b). BIRCH: An Efficient Data Clustering Method for Very Large Databases. In ACM SIGMOD International Conference on Management of Data. [Zimmermann et al., 2002] Zimmermann, J., M´e, L., & Bidan, C. (2002). Introducing Reference Flow Control for Detecting Intrusion Symptoms at the OS Level. In Fifth International Workshop on the Recent Advances in Intrusion Detection (RAID’2002) (pp. 292–306). Zurich, Switzerland. [Zimmermann, 2003] Zimmermann, Z. (2003). D´etection d’Intrusions Param´etr´ee par la Politique par Contrˆole de Flux de R´ef´erence. PhD Thesis.

Appendix A

Description of the attacks in the DARPA 98 intrusion detection data sets apache2

Denial of service attack against an apache web server where a client sends a request with many MIME headers. These requests will cause the server to slow down, and may eventually crash it.

back

Denial of service attack against apache web server where a client requests a URL containing many backslashes. As the server tries to process these requests it will slow down and be unable to process other requests.

dictionary

Guess passwords for a valid user using simple variants of the account name. Dictionary guessing can be done over many services. Most common are telnet, ftp, pop, or imap.

eject

(SunOS 5.5.1,5.5,5.4,5.3) eject program is used by removable media devices that do not have an eject button or those managed by Volume Management. Due to insufficient bounds checking on arguments in the volume management library, libvolmgt.so.1, it is possible to overwrite the internal stack space of the eject program. If exploited, this vulnerability can be used to gain root access on the target system.

ffbconfig

(SunOS 5.5,5.5.1) The ffbconfig program configures the Creator Fast Frame Buffer (FFB) Graphics Accelerator, which is part of the FFB Configuration Software Package, SUNWffbcf. This software is used when the FFB Graphics accelerator card is installed. Due to insufficient bounds checking on arguments, it is possible to overwrite the internal stack space of the ffbconfig program. If exploited, this vulnerability can be used to gain root access on the target system.

fdformat

(SunOS 5.5.1,5.5,5.4,5.3) The fdformat program formats diskettes and PCMCIA memory cards. The program also uses the same volume management library, libvolmgt.so.1, and is exposed to the same vulnerability as the eject program.

2

Description of the attacks in the DARPA 98 intrusion detection data sets

ftp write

The anonymous FTP root directory (∼ftp) and its subdirectories should not be owned by the ftp account or be in the same group as the ftp account. This is a common configuration problem. If any of these directories are owned by ftp or are in the same group as the ftp account and are not write protected, an intruder will be able to add files (such as a .rhosts file) and eventually gain access to the system.

guest

Try to guess password for guest account. Guest accounts are often left open or with simple passwords on badly configured systems.

httptunnel

Multiple session scenario in which an attacker installs a client on the victim machine which wakes up at predefined times to talk to a server controlled by the attacker. All communication is performed in such a way as to make the transactions look like a normal user browsing web pages.

imap

(UW IMAP prior to 10.165, multiple platforms) The imap server must be run with root privileges so it can access mail folders and undertake some file manipulation on behalf of the user logging in. After login, these privileges are discarded. However, a vulnerability exists in the way the login transaction is handled, and this can be exploited to gain privileged access on the server. By preparing carefully crafted text to a system running a vulnerable version of the Imap server, remote users can cause a buffer overflow and execute arbitrary instructions with root privileges.

ipsweep

Surveillance sweep on a network to determine what machines are on a network, as well as what services these machines are running.

land

Some implementations of TCP/IP are vulnerable to packets that are crafted in a particular way (a SYN packet in which the source address and port are the same as the destination —i.e., spoofed). Land is a widely available attack tool that exploits this vulnerability.

loadmodule

(SunOS 4.1.x) The loadmodule program is used by the xnews window system server to load two dynamically loadable kernel drivers into the currently running system and to create special devices in the /dev directory to use those modules. Because of the way the loadmodule program sanitizes its environment, unauthorized users can gain root access on the local machine. A script is publicly available and has been used to exploit this vulnerability.

mailbomb

Simple attack where an attacker floods a user’s mailbox with messages.

mscan

An IP and port scanner which looks for a variety of security weaknesses.

3 multihop

Multi-day scenario in which a user first breaks into one machine, and then uses the compromised machine as a stepping stone for different attacks on other machines. It uses several different exploit methods to gain access.

named

BIND 4.9 releases prior to BIND 4.9.7 and BIND 8 releases prior to 8.1.2 do not properly bounds check a memory copy when responding to an inverse query request. An improperly or maliciously formatted inverse query on a TCP stream can crash the server or allow an attacker to gain root privileges.

neptune

“syn-flooding”. For each half-open connection made to a machine the tcpd server adds a record to a data structure describing all pending connections. This data structure is of finite size, and it can be made to overflow by intentionally creating too many partially-open connections. The half-open connections TCP stack on the victim server system will eventually fill; then the system will be unable to accept any new incoming connections until the table is emptied out. Normally there is a timeout associated with a pending connection, so the half-open connections will eventually expire and the victim server system will recover. However, the attacking system can simply continue sending IP-spoofed packets requesting new connections faster than the victim system can expire the pending connections. In some cases, the system may exhaust memory, crash, or be rendered otherwise inoperative.

nmap

Network mapping using the nmap tool. It offers a mode of exploring the network with various options including SYN, FIN and ACK scanning with both TCP and UDP, as well as ICMP (Ping) Scanning.

perl

On systems that support saved set-user-ID and set-group-ID, suidperl does not properly relinquish its root privileges when changing its effective user and group IDs. On a system that has the suidperl or sperl program installed and that supports saved set-user-ID and saved set-group-ID, anyone with access to an account on the system can gain root access.

phf

Any CGI program which relies on the CGI function escape shell cmd() to prevent exploitation of shell-based library calls may be vulnerable to attack. In particular, this includes the “phf” program which is distributed with the example code. The phf program allows remote users to run arbitrary commands on the server.

pod

“ping of death”. Some systems will react in an unpredictable fashion when receiving oversized IP packets. Possible reactions include crashing, freezing, and rebooting

4

Description of the attacks in the DARPA 98 intrusion detection data sets

portsweep

Surveillance sweep through many ports to determine which services are supported on a single host. Portsweeps can be made partially stealthy by not finishing the 3-way handshake that opens a port (ie. FIN scanning).

processtable

Fills up the process table of a victim machine by slowly opening many sessions and letting them hang. Once the process table is full the victim will not be able to launch any additional processes.

ps

(SunOS 5.x) A race condition in the ps program can be exploited to gain root access if the user has access to the temporary files. Access to temporary files may be obtained if the permissions on the /tmp and /var/tmp directories are set incorrectly. Any users logged in to the system may gain unauthorized root privileges by exploiting this race condition.

rootkit

Rootkit is a scenerio in which an attacker breaks into and then installs a rootkit on a target machine. A rootkit is a collection of programs that are intended to help a hacker maintain access to a machine once it has been compromised. A typical rootkit consists of a sniffer, versions of login, su, or other programs with backdoors which allow for access, and new versions of ps, netstat, and ls which hide the fact that a sniffer is running and hide files in certain directories. Once the rootkit has been installed, the attacker may come back several times to download the sniffer logs.

saint

A network probing tool designed after Satan. It performs IP and port scans looking for well known vulnerabilities.

satan

A network probing tool which looks for well known security vulnerabilities.

sendmail

With the release of sendmail version 8.8.3, a serious security vulnerability was introduced. By sending a carefully crafted email message to a system running this vulnerable version, a remote intruder can force sendmail to execute arbitrary commands with root privileges.

smurf

In the “smurf” attack, attackers use ICMP echo request packets directed to IP broadcast addresses from remote locations to generate denial-of-service attacks. There are three parties in these attacks: the attacker, the intermediary, and the victim (note that the intermediary can also be a victim).

snmpgetattack

A scenario in which an attacker guesses the SNMP community password (using for example the snmpguess attack explained below) and remotely monitors router activity. The SNMP password is set to “public” by default, and is often never changed from this default value.

snmpguess

A dictionary attack against the SNMP community.

5

spy

Spy is an information collector that comes back to a compromised machine several times to collect information. A spy might be looking for confidential data files or reading user’s personal mail. A spy will take steps to minimize the possibility of detection.

sqlattack

Gain access to a shell on a remote system by escaping out of Postgres SQL. Once attackers have access to a shell, they may execute other exploits to further elevate their privileges.

syslog

When Solaris syslogd receives an external message it attempts to do a DNS lookup on the source IP. If this IP does not match a valid DNS record then syslogd will crash with a “Segmentation Fault”.

teardrop

In some implementations of TCP/IP, IP fragmentation re-assembly code does not properly handle overlapping IP fragments. Teardrop is a widely available attack tool that exploits this vulnerability.

udpstorm

When a connection is established between two UDP services, each of which produces output, these two services can produce a very high number of packets that can lead to a denial of service on the machine(s) where the services are offered. Anyone with network connectivity can launch an attack; no account access is needed. For example, by connecting a host’s chargen service to the echo service on the same or another machine, all affected machines may be effectively taken out of service because of the excessive number of packets produced. This attack is also known as the fraggle attack.

warez

A multisession scenerio in which the warezmaster puts a file on an anonymous ftp site with a world-writeable directory (such as an “incoming” directory) and warezclients then retrieve the file.

worm

An attacker releases a self-replicating program which gains access to machines by using a priori knowledge of valid usernames and passwords.

xlock

An attacker can send a trojan version of xlock to an open X server, hoping to convince the user to type in their password (which will then be returned to the attacker).

xsnoop

Monitor keystrokes on the X server of a user who left their X display open (as would happen after typing xhost +), hoping to see the user type in a username and password.

6

Description of the attacks in the DARPA 98 intrusion detection data sets

xterm

Some problems exist in both the xterm program and the Xaw library that allow user supplied data to cause buffer overflows in both the xterm program and any program that uses the Xaw library. These buffer overflows are associated with the processing of data related to the inputMethod and preeditType resources (for both xterm and Xaw) and the Keymap resources (for xterm). Exploiting these buffer overflows with xterm when it is installed, setuid-root or any setuidroot program that uses the Xaw library can allow an unprivileged user to gain root access to the system.

Appendix B

Intrusion and intrusion detection glossary Alert

A formatted message describing a circumstance relevant to host or network security.

Analyzer/Sensor

An intrusion detection component that may be a hardware/software program that provides alerts to the manager.

Anomaly analyzer

An analyzer whose detection method is based on a deviation of the observed profile according to an a priori established profile.

Attack

An action that may be performed by a user or a program to exploit a vulnerabilty. An attempt to bypass security controls on a computer is also considered an attack. The attack may alter, release, or deny data. Whether an attack will succeed depends on the vulnerability of the computer system and the effectiveness of existing counter-measures.

Attacker/Intruder

A criminal or malicious user.

Audit trail

A chronological record of system resource usage. This includes user login, file access, other various activities, and whether any actual or attempted security violations occurred, legitimate and unauthorized.

Bug

An unwanted and unintended property of a program or piece of hardware, especially one that causes it to malfunction.

Computer security intrusion

Any event of unauthorized access or penetration to an information system.

Counter-measures

Action, device, procedure, technique, or other measure that reduces the vulnerability of an automated information system.

8

Intrusion and intrusion detection glossary

Crash

A sudden, usually drastic failure of a computer system.

Denial of service

An attack whose objective is to render a service unavailable for legitimate users.

Distributed denial of service

A denial of service where many actors are recruited for this task.

False negative

Occurs when an actual intrusive action has occurred but the IDS allows it to pass as non-intrusive one.

False positive

Occurs when the IDS classifies an action as anomalous (a possible intrusion) while it is a legitimate one.

Fault tolerance

The ability of a system or a component to continue normal operations despite the presence of hardware or software faults.

Flooding

A denial of service attack that consists in sending huge amount of legitimate traffic against a victim.

Hijacking

A type of network security attack in which the attacker takes control of a communication between two entities and masquerades one of them. This attack may be used simply to gain access to the messages, or to enable the attacker to modify them before retransmission.

Intrusion

A finite and not empty set of actions, which may be carried out maliciously by a user or a program, to violate the security policy of an information system.

Intrusion detection system

A software or hardware component that is used to detect intrusion in an information system.

IP Spoofing

An attack whereby a system attempts to illicitly impersonate another system by using IP network address.

Keylogger

A computer program that captures the keystrokes of a computer user and stores them. Most malicious keyloggers send remotely this data to a third party.

Manager

An intrusion detection component that is a program that has a console for managing alerts.

Master

A program that is generally installed in a compromised machine. It is used by an attacker to launch remotely some commands.

Misuse analyzer

An analyzer whose detection method is based on the knowledge of attack features.

9

Penetration

The successful unauthorized access to an information system.

Probe

Any effort to gather information about a machine or its users for the apparent purpose of gaining unauthorized access to the system at a later date.

Profile

Patterns of a user’s, application’s, network traffic’s, etc., activity.

Race condition

A race condition is an undesirable situation that occurs when a device or system attempts to perform two or more operations at the same time, but because of the nature of the device or system, the operations must be done in the proper sequence in order to be done correctly. In computer information system, a race condition may occur if commands to read and write data are received at almost the same instant, and the machine attempts to overwrite some or all of the old data while that old data is still being read. The result may be one or more of the following: a computer crash, an illegal operation, notification and shutdown of the program, errors reading the old data, or errors writing the new data.

Slave/Agent/Handler

A program generally installed in a compromised machine. It executes the different commands sent by the master to attack the target machine(s).

Spoofing

Pretending to be someone else. Impersonating, masquerading, and mimicking are forms of spoofing.

Spyware

A program that surreptitiously monitors legitimate users actions. Most of the time, it is a remote control program. Software companies also use Spyware to gather data about customers.

Trojan horse

An apparently useful and innocent program containing additional hidden code which allows the unauthorized collection, exploitation, falsification, or destruction of data.

Vulnerability

Hardware, firmware, or software flow that leaves an IS open for potential exploitation. A weakness in automated system security procedures, administrative controls, physical layout, internal controls, and so forth, that could be exploited by a threat to gain unauthorized access to information or disrupt critical processing.

Worm

Independent program that replicates from machine to machine across network connections often clogging networks and information systems as it spreads.