System Failure Prediction using Log Analysis

Animesh Dutta

Animesh Dutta

Kolkata, West Bengal

0 0
  • 0 Collaborators

Achieving accurate predictions for a system failure with adequate time left is quite difficult. In this project, I present a simple approach to detect the failure by parsing the log files quite in advance. We generate an early warning before a failure condition arises ...learn more

Project status: Published/In Market

HPC, Artificial Intelligence

Intel Technologies
Intel Python

Code Samples [1]Links [1]

Overview / Usage

With recent emerging technologies in computing, we have seen tremendous dependence on these to serve a variety of critical services. Sectors such as banking, research organizations and railway/flight reservation portals rely on computer systems. It is quite clear that reducing the downtime of these systems is extremely important and can be vital for various operations in day to day life. Various examples have portrayed the impact of disruption from a computing failure. For instance, in March 2008, a computing failure in the baggage system at Heathrow airport is estimated to have cost $32 million and affected 140,000 people [1].

Console logs or the logs generated in .nmon format for Linux systems have a piece of descriptive information that we have used here to serve our purpose.

System Failure Prediction is essential in many applications like where a computer needs to perform high computations. Very high usage of hard disk or crash of RAM can prevent the applications being executed on HPC (High-Performance Computing). High-Performance Computing is the use of parallel programming in order to run complex programs efficiently. The recovery of HPC can take long or even might not be possible at times. As per the research conducted at the University of Toronto, their data shows that by far the most common types of hardware failures for HPC are due to problems with memory or CPU. 20% of hardware failures are attributed to memory and 40% are attributed to CPU. Hence this approach can help us to fully utilize the potential of HPC and alert a user before system failure arises. There have been several approaches to system failure prediction. Some of them include Bayes networks, Hidden Markov Models (HMM), Partially Observable Markov Decision Process (POMDP), Support Vector Machines (SVMs). [2,3] The use of time series forecasting has been common too but it didn’t include the parameters that we are going to state next which can be beneficial.

The rest of the paper is divided is structured as follows. Section 2 consists of a description of .nmon extension log files generated from a system and the process of extracting the CPU, RAM and Hard disk utilization from the various parameters.

Log files obtained from systems comprises of information on the status and memory consumption of a system. We know that three main utilizations of a computer are CPU, RAM, and hard disk utilization. These log files can provide us with timestamps and the exact utilization of resources on definite timestamps. We have taken into account the values at timestamps with the same interval.

We had a data comprising of the log files of a system generated in the past five days in .nmon format. The timestamp variation considered is fixed and has a fixed gap of 10 minutes.

Methodology / Approach

System Log Files (Parameters reduction)

In computing, log files are the ones which keep track of all the events occurring in an operating system. Most operating systems include a logging system. System log files enable a dedicated, system to generate, filter and analyze log messages. These files are essential to study the consumption of various resources while processes are executed in a computer system. System log files give the status of a particular process such as security problems or system errors. By reviewing the log files, the user can figure out the reason for any issue arising or whether all required processes are loading properly. It contains information about the software, hardware and system components.

We had .nmon files which were converted to _.csv _for proper study. We can also visualize the data using nmon visualizer tool. The parameters we are considering for CPU, RAM, and Hard disk utilization are listed below.

CPU Utilization

Table 1: CPU utilization

User%:

This states that the processor is spending x% of its time running user space processes. A user-space process is one that doesn’t use kernel. Some common user-space process includes shells, compilers, databases and all programs related to desktop. If the processor isn’t idle then usually the majority of CPU time is used to run user space processes.

Sys%:

This states that the processor is spending x% of its time running user space processes. A user-space process is one that doesn’t use kernel. Some common user-space process includes shells, compilers, databases and all programs related to desktop. If the processor isn’t idle then usually the majority of CPU time is used to run user space processes.

Wait%:

This states that the processor is spending x% of its time running kernel processes. All the processes and system resources are handled by the Linux kernel. The kernel performs tasks like running system processes managing hardware devices like a hard disk in kernel space.

Idle%:

This shows what percentage of CPU is waiting for other processes to complete. At times, the processor might initiate a read/write operation and needs to wait for other processes to finish.

Hence we can see in Table 1, Idle% doesn’t contribute to CPU utilization. Adding User%, Sys% and Wait% gives us the percentage of CPU utilization at that particular timestamp.

CPU utilization = User% + Sys% + Wait%  								(1)

We take the logarithmic value to keep the value small i.e between 0 and 2. For example,

log(1) = 0     (CPU is mainly idle)

log(100) = 2 (CPU is fully utilized)

2.1.2. RAM utilization

Ram or Random Access Memory is a computer data storage used to store working data or machine code. Even though a computer might have 16 GB or more RAM, excessive usage of memory has been causing an immense problem which might lead to system crash.

High RAM usage can lead to many issues. The major causes for a high RAM usage being too many startup applications, too many processes running in the background, low RAM.

Table 2: Memory nomenclature for RAM

**MemTotal: **Total usable memory

**MemFree: **The amount of physical memory not used by the system

**Buffers: **Memory in buffer cache, so relatively temporary storage for raw disk blocks.

**Cached: **Memory in the page cache (Diskcache and Shared Memory)

Table 2 shows common memory notations. We can calculate total memory used using the formula:

MemUsed = MemTotal-MemFree-Buffers-Cached 							(2)

% of used RAM at a particular timestamp = (MemUsed/MemTotal) \* 100 				(3)

2.1.3. Hard disk utilization

A hard disk or is an electromechanical data storage device to store and retrieve digital information. In Linux, we can get the hard disk utilization by various drives using ‘df -h’ command. This gives the percentage use of a particular drive. Suppose a write operation is taking place in a hard disk and memory of Hard disk is constantly getting filled. We can get the free space of hard disk at any timestamp. Subtracting this from the total hard disk memory provides the hard disk space in use.

Hard disk utilization %= (Hard disk space used/ total hard disk space) *100 (4)

2.2. PCA

Now we have our values for CPU, RAM and hard disk utilization for timestamps. We apply PCA to get a single reduced value out of these 3 parameters. As we know PCA or Principal Component Analysis is a way to deal with highly correlated variables. We can get a single value for all the utilizations stated above. We can then apply univariate time series forecasting to predict a single value for the future timestamps.

We have applied the covariance matrix method to compute PCA.

The covariance matrix is a matrix whose element in j, k position is calculated as the covariance between jth and kth elements of a random vector.

The covariance matrix is solved by calculating the eigenvalues (λ) and eigenvectors (V) as follows:

VΣ = λV (5)

where V and λ in equation 5 represent the eigenvectors and eigenvalues of the covariance matrix, respectively.

The eigenvalues are scalar. [4] The eigenvectors are non-zero vectors representing the principal components (PCs), i.e. each eigenvector represents one PC. The eigenvectors state the direction of PCA space while the eigenvalue represents the length, magnitude or robustness of eigenvectors. The eigenvector which has the highest scalar value is used to depict the first principal component and has maximum variance.

Steps for applying PCA:

a) Standardized the data.

b) Calculate the eigenvalues and eigenvectors from the covariance matrix.

c) Sort the eigenvalues in decreasing order to rank corresponding eigenvectors.

d) Select 1 eigenvector corresponding to the largest eigenvalue. This gave us the reduced parameter for each timestamp.

2.3. Using LSTM to predict the value at future timestamps (univariate time series forecasting):

Figure in project briefly summarizes the approach we’ve used in our paper. We’ve let a moving window run on the log files and fed it as inputs to the LSTM. On getting a high value of RAM, CPU and Hard disk utilization, a seismic effect is generated and we can then send an Email or SMS alert to the user if we obtain a failure prediction.

Technologies Used

Technologies : Machine Learning, Deep Learning, LSTM

Software : Python (Jupyter Notebook)

Libraries : numpy, pandas, sklearn, matplotlib, seaborn, tensorflow, keras

Hardware : 1 TB hard-disk, 8 GB RAM, intel Core i5 7th Gen processor, NVIDIA GEFORCE Graphics card(2 GB)

Repository

https://github.com/animeshdutta888/System-Failure-Prediction-using-log-analysis

Comments (0)