Multi-Attribute Selectivity Estimation Using Deep Learning

Jees Augustine

Jees Augustine

Arlington, Texas

0 0
  • 0 Collaborators

Selectivity estimation-the problem of estimating the result size of queries is a fundamental yet challenging problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. ...learn more

Project status: Under Development

Artificial Intelligence

Intel Technologies
AI DevCloud / Xeon

Overview / Usage

Selectivity estimation – the problem of estimating

the result size of queries – is a fundamental problem in

databases. Accurate estimation of query selectivity involving

multiple correlated attributes is especially challenging. Poor

cardinality estimates could result in the selection of bad plans

by the query optimizer. We investigate the feasibility of using

deep learning based approaches for both point and range queries

and propose two complementary approaches. Our first approach

considers selectivity as an unsupervised deep density estimation

problem. We successfully introduce techniques from neural

density estimation for this purpose. The key idea is to decompose

the joint distribution into a set of tractable conditional probability

distributions such that they satisfy the autoregressive property.

Our second approach formulates selectivity estimation as a

supervised deep learning problem that predicts the selectivity of a

given query. We also introduce and address a number of practical

challenges arising when adapting deep learning for relational

data. These include query/data featurization, incorporating query

workload information in a deep learning framework and the

dynamic scenario where both data and workload queries could

be updated.

Methodology / Approach

We have a completely different approach to the selectivity estimation problem here. We are training a density estimator to ensure that complete database is estimated as a model. Once model completely absorb the database probability density, we intern treat the model as a database and queues are directed towards the model, which is much smaller in size and takes much lesser time. In addition we claim that the answers returned by the model offers much better bound on the answers than conventional histogram and wavelets methods. We make use of a popular density estimator MADE to completely absorb the probability density.

Technologies Used

PyTorch, Python, Intel(R) Core(TM) i7-6850K CPU @ 3.60GHz, Matplotlib, Python Data Analysis Library(Pandas),

Comments (0)