Anomaly Detection in Computer Network using Autoencoders
Navabhaarathi Asokan
Coimbatore, Tamil Nadu
- 0 Collaborators
Anomaly detection is a sophisticated technique used to identify patterns and behaviors that deviate significantly from normal network activity. In computer networks, normal behavior is typically established by observing historical data, which is then used as a benchmark for detecting anomalies. ...learn more
Project status: Published/In Market
Overview / Usage
The "Anomaly Detection in Computer Networks using Autoencoders" project addresses several problems related to network security and monitoring. Some of the problems that are solved by this project include:
- Unsupervised Anomaly Detection: Traditional methods often rely on labeled datasets for supervised learning, which can be challenging to obtain in the context of network anomalies. This project employs unsupervised learning using autoencoders, enabling the detection of anomalies without requiring labeled examples.
- Detection of Complex Anomalies: Autoencoders have the capability to learn complex patterns in data. This project aims to detect intricate and novel anomalies that might not be easily recognizable using rule-based or signature-based approaches.
- Adaptive Learning: The autoencoder model adapts to changing network behavior over time. As network dynamics evolve, the model can continue to identify anomalies by learning and adapting to new normal patterns.
- Reduced False Positives: By learning the normal behavior of the network and identifying deviations from it, the project helps in reducing false positive alerts. This is particularly important for efficient allocation of security personnel's time and resources.
- Early Threat Detection: Anomalies in network behavior can often be indicators of potential cyber threats or attacks. Detecting anomalies early can lead to timely responses and mitigation, preventing potential data breaches or service disruptions.
- Scalability: The project leverages machine learning techniques that can be applied to large-scale network data. This addresses the challenge of processing and analyzing vast amounts of network traffic in real-time.
- Non-linear Patterns: Autoencoders can capture non-linear relationships in data, allowing the project to effectively detect anomalies that might exhibit complex interactions between different network features.
- Human Expertise Augmentation: While automated techniques are employed, human expertise is still valuable for setting appropriate thresholds and interpreting the detected anomalies. The project combines machine learning with human insights for more accurate results.
- Continuous Monitoring: The autoencoder-based system can operate in real-time or near-real-time, providing continuous monitoring of network activity. This is crucial for quickly identifying and responding to emerging threats.
- Detection of Insider Threats: Anomalies might include not only external attacks but also insider threats, where authorized users misuse their access privileges. The project contributes to detecting such anomalies in network behavior.
Methodology / Approach
Model:In the pursuit of robust anomaly detection in computer network data, a combination of two powerful techniques has been employed: Isolation Forest and Autoencoders. This dual approach harnesses the strengths of both methodologies to enhance the precision and effectiveness of anomaly detection in complex network environments.
Autoencoders Architecture:
Input Layer:
Neurons: Number of input features (determined by input\_dim).
Activation: None (raw input).
Encoding Layers:
Layer 1: Dense layer with 64 neurons and ReLU activation.
Dropout: 20% dropout for regularization.
Layer 2: Dense layer with 32 neurons and ReLU activation.
Encoding Bottleneck: Dense layer with encoding_dim (10) neurons and ReLU activation.
Decoding Layers:
Layer 1: Dense layer with 32 neurons and ReLU activation.
Dropout: 20% dropout for regularization.
Layer 2: Dense layer with 64 neurons and ReLU activation.
Output Layer: Dense layer with the same number of neurons as input features (specified by input_dim) and sigmoid activation.
Model Compilation:
Optimizer: Adam optimizer.
Loss Function: Mean Squared Error (MSE) for reconstruction loss.
Training:
Input and Target: Scaled input data (X_scaled) used as both input and target.
Epochs: 20.
Batch Size: 32.
Anomaly Detection:
After training, the model calculates MSE between original data and its reconstruction.
Anomaly threshold is set at the 99.9th percentile of MSE values.
Identifying Anomalies:
Data points with MSE above the threshold are considered
Ensemble Method:
Combinig both randomforestclassifier and isolation forest.
Isolation Forest (IsolationForest):
Isolation Forest is used for initial anomaly score estimation.
contamination is set to 0.0045, and random_state is 42.
Anomaly scores are predicted for data points, where -1 indicates anomalies and 1 indicates normal data.
Random Forest Classifier (RandomForestClassifier):
Random Forest Classifier refines the anomaly detection process.
n_estimators is 100, and random_state is 42.
It is trained on features and anomaly labels derived from the Isolation Forest.
Anomaly predictions are made, and anomalies are identified where the prediction is 0 (anomaly).
Technologies Used
Intel Extension for Tensorflow*:
-
Plug into Tensorflow 2.10 or late to accelerate training and inference on Intel GPU hardware with no code changes.
-
accelerate AI performance with Intel oneAPI Deep Neural Network Library(oneDNN) features such as graph optimizations and memory pool allocation.
-
Automatically use Intel Deep Learning Boost instruction set features to parallelize and accelerate AI workloads.
-
Enable optimizations by setting the environment variable by
TF_ENABLE_ONEDNN_OPTS=1
Intel Distribution for Python*:
- The distribution is designed to scale efficiently across multiple CPU cores and threads. This scalability is essential for applications that required high-performance computing.
- Essential Python bindings for easing integration of Intel native tools with the python project. It seamlessly works with Intel software and libraries.
- Intel Distribution for python maintains compatibility with the standard python distribution(cpython). This means that most existing python packages and libraries can be used seamlessly with this distribution.
Intel Extension for scikit-learn*:
- Intel extension can accelerate scikit-learn algorithms by up to 100x, which can significantly reduce the time it takes to train and deploy machine learning models.
- The extension is seamlessly integrated with scikit-learn, so you can continue to use the same API and code.
- The intel extension supports multiple devices, including CPUs, GPUs, and FPGAs. This allows you to choose the best device for your specific applicatino and workload.
Add two lines of code to patch all compatible algorithms in your Python script.
from sklearnex import patch\_sklearn
patch_sklearn()
Wireshark: Data packet sniffing tool
Repository
https://github.com/nb0309/Network-Traffic-Analysis-using-Machine-learning