Multi-Attribute Selectivity Estimation Using Deep Learning
Jees Augustine
Arlington, Texas
- 0 Collaborators
Selectivity estimation-the problem of estimating the result size of queries is a fundamental yet challenging problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. ...learn more
Project status: Under Development
Intel Technologies
AI DevCloud / Xeon
Overview / Usage
Selectivity estimation – the problem of estimating
the result size of queries – is a fundamental problem in
databases. Accurate estimation of query selectivity involving
multiple correlated attributes is especially challenging. Poor
cardinality estimates could result in the selection of bad plans
by the query optimizer. We investigate the feasibility of using
deep learning based approaches for both point and range queries
and propose two complementary approaches. Our first approach
considers selectivity as an unsupervised deep density estimation
problem. We successfully introduce techniques from neural
density estimation for this purpose. The key idea is to decompose
the joint distribution into a set of tractable conditional probability
distributions such that they satisfy the autoregressive property.
Our second approach formulates selectivity estimation as a
supervised deep learning problem that predicts the selectivity of a
given query. We also introduce and address a number of practical
challenges arising when adapting deep learning for relational
data. These include query/data featurization, incorporating query
workload information in a deep learning framework and the
dynamic scenario where both data and workload queries could
be updated.
Methodology / Approach
We have a completely different approach to the selectivity estimation problem here. We are training a density estimator to ensure that complete database is estimated as a model. Once model completely absorb the database probability density, we intern treat the model as a database and queues are directed towards the model, which is much smaller in size and takes much lesser time. In addition we claim that the answers returned by the model offers much better bound on the answers than conventional histogram and wavelets methods. We make use of a popular density estimator MADE to completely absorb the probability density.
Technologies Used
PyTorch, Python, Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz, Matplotlib, Python Data Analysis Library(Pandas),