Overview / Usage
Background:
Data was collected by a group of scientists in Japan to examine their new (1990’s) classifier for multidimensional curves
Applications of the classifier:
Analysis of several natural signals
Time-series like speech
Biological signals
Comparative advantage of the classifier:
Intuitive explanation of the resulting classification rule
Form of the curve is the most important cue for classification
Try to find the regions that all or almost all curves of a class pass through and no other curve of another class passes through
Dataset:
1. “Japanese Vowels Dataset” culled from the UCI Repository -- (https://archive.ics.uci.edu/ml/datasets/)
Reduced is taken from -- (http://odds.cs.stonybrook.edu/about-odds/)
2. Nine male speakers uttered two Japanese vowels /ae/ successively. The data contains 12 attributes, forming a discrete-time series of 12 LPC cepstrum coefficients. There are 640 timeseries elements in the original dataset.
Cepstrum: Spectrum - inverse DFT of the log magnitude of the DFT of a spectrum (signal)
Owners:�� Mineichi Kudo, Jun Toyama, Masaru Shimbo
Information Processing Laboratory �Division of Systems and Information Engineering
Significance:
Vowel Recognition by Formants:
The dataset is for recognition of the five Japanese vowels {/a/, /e/, /i/,/o/,/u/).
“Formants” are calculated for each sample through frequency analysis (peak frequencies).
Each sample is expressed by 4-6 formants.
Anomalies:
Skew the model for vowel recognition
Impact:
National Security
Espionage – to determine the true fluency of the speaker
Impact:
Diagnosing & Predicting Critical Diseases
Brain Injury
Stroke
Dementia
Respiratory Impairment
Autism
Cerebral Palsy
Cleft Palate
Key Challenges:
The main challenge is working with reduced data without a time dimension
The original data had a clear time series, but did not have demarcated anomalies
The reduced data tagged anomalies and lost the time dimension
Classical techniques for speech cannot be run on the data
HMM was the top cited technique in the literature
Deep Learning LSTM would have been a strong contender, as it outperforms HMM in most applications
-
The data is imbalanced
A minority class is created by uniformly down-sampling one of the majority classes
Class 1 (speaker 1) was downsampled to 50 outliers
Inliers contained classes 6,7,8. Other classes were discarded.
Anomalies are not dispersed throughout the other classes
Class 1 was chosen arbitrarily
Related Work:
Paper:
Mineichi Kudo and Jun Toyama and Masaru Shimbo. 1999.
Multidimensional curve classification using passing-through regions.
Journal of Pattern Recognition Letters. 20(11): 1103 -1111.
Proposed a methodology for classifying sets of data points in a multidimensional space or
multidimensional curves. Worked on a bigger dataset of 6000 samples.
-
Paper
Aggarwal, Charu and Sathe, Saket. LODES: Local Density meets Spectral Outlier Detection.2016. Proceedings of the 2016 SIAM International Conference on Data Mining. 171-179
Worked with the reduced version. Describes iterative approach for discovering a high quality spectral embedding by combining local density-based methods with spectral methods.
Classifiers are: -
Paper
Aggarwal, Charu and Sathe, Saket. 2015. Theoretical foundations and Algorithms for Outlier Ensembles. Journal of SIGKDD Explorations, 17: 24-47.
Worked with the reduced version.
Classifiers are:
KNN-detector (k=5 and k=10)
LOF detector -
Paper
Xiaoqing Weng and Junyi Shen. 2008. Classification of multivariate time series using locality preserving projections. Journal of Knowledge-Based Systems. 21: 581-587.
Proposes a new approach to classifying Multivariate time series (MTS) based on Locality Preserving Projections (LPP)
Japanese Vowels dataset contains 640 time series of 12 LPC cepstrum coefficients
Insights:
Hidden Markov Model attained up to 96.2%
HMM is domain specific (speech recognition)
LSTM is a technique that classically outperforms HMM in this domain
They reduced the data so one cannot use LSTM
Do not explicitly explain the rationale behind their choice of algorithms
Do not use hybrids
Do not use techniques from broader domains
We will use alternative anomaly detection techniques & hybrids
In future work, we will use techniques from other domains
Methodology / Approach
1)Basic
Naive Bayes, SVM, Multi-Layer Perceptron, Simple Logistic, AdaBoost, Decision Table, J48, Random Forrest, Random Tree
2)Ensemble
3)Genetic Algorithm
Technologies Used
Java etc.