Event Detection Using Bit Locality Sensitive Hashing & DBSCAN clustering on Twitter Data
Segun sodimu
Ogun State
- 0 Collaborators
Locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input such as tweets, documents, images e.t.c items into the same "buckets" with high probability. We used the concepts of LSH to detect Trending tweets on twitter data ...learn more
Project status: Published/In Market
Intel Technologies
Intel Integrated Graphics
Overview / Usage
Bit locality sensitive hashing with DBScan is an unsupervised machine learning algorithm that takes in twitter data and clusters them to detect trending topic.
The goal of using LSH is to group similar tweets to the same buckets, Candidate pairs are those that hash at least once to the same bucket.
By using MinHash with different permutations as the input data, It is easier to perform nearest neighbour search and group tweets that are similar
Methodology / Approach
Data Collection
We start by downloading twitter data and save into csv file.
Preprocessing
We perform different preprocessing technique on the dataset such as :
- Removing Non-alphanumeric character.
- typecast all words into lower case.
- Eliminate punctuations.
- Replace numbers with the word equivalent.
- Remove stop words
- Perform Lemmatization
Feature Representation
TFIDF, short for term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
We used TFIDF to convert our already preprocessed dataset from human-readable text to sparse vectors for each tweet.
**MinHash Hashing **
MinHash is a technique for quickly estimating how similar two sets are. We will be using Minhash here to transform each tweet into a hashed equivalent using 4 ([64, 128, 256, 512]) different permutations for the hash function.
Locality Sensitive Hashing
Locality-sensitive hashing(LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability.
- We passed the already hashed datasets to an LSH algorithm so as to group similar tweets to the same buckets.
- We filtered and remove noise from all the bucket.
Dimensionality Reduction with PCA
Due to the high dimension of the input data, We will need to reduce the dimension so that it will be faster with less memory consumption when performing PCA operations.
DBSCAN Clustering
Density-based spatial clustering of applications with noise (DBSCAN) is a well-known data clustering algorithm that is commonly used in data mining and machine learning.
- We passed the result of the LSH operations to a DBScan clustering algorithm to automatically cluster and group result together.
- We remove noise from the cluster.
- We select the max cluster with content and get the most occurrence word.
Technologies Used
Python3
Density-based spatial clustering of applications with noise (DBSCAN)
MinHash
Locality Sensitive Hashing