Drug Discovery using oneAPI AI Analytics toolkit

Abhishek Nandy

Abhishek Nandy

Kolkata, WB

This project is an example of using the Lipophilicity dataset for drug discovery. It uses molecular descriptors to predict the experimental lipophilicity of a molecule and visualizes the results using plotly. The project also allows users to design new molecules by selecting functional groups ...learn more

Project status: Published/In Market

oneAPI

Intel Technologies
oneAPI

Overview / Usage

This project demonstrates how to use the oneAPI AI Analytics Toolkit to perform drug discovery. The toolkit includes several libraries and tools for data analytics, machine learning, and deep learning, and it can be used on a variety of hardware architectures, including CPUs, GPUs, and FPGAs.

The project uses the Lipophilicity dataset for drug discovery, which contains experimental data on the ability of small molecules to dissolve in lipids. The goal is to build a regression model that can predict the lipophilicity of a molecule based on its molecular descriptors.

The project first loads the dataset, calculates the molecular descriptors for each molecule using the RDKit library, and splits the data into training and test sets. It then trains a random forest regression model on the training set and evaluates its performance on the test set using the R-squared metric.

The project also includes a user interface built with the Streamlit library that allows the user to design new molecules by selecting functional groups to add to a molecular scaffold. The user interface visualizes the molecular scaffold and the added functional groups using the RDKit and Draw libraries.

Overall, this project demonstrates how the oneAPI AI Analytics Toolkit can be used to perform drug discovery tasks and provides an example of how to build a user interface for molecule design using Streamlit.

Methodology / Approach

The methodology for the above project can be broken down into the following steps:

  1. Load data: Load the Lipophilicity dataset for drug discovery using the pandas library.
  2. Calculate molecular descriptors: Use the RDKit library to calculate the molecular descriptors of each molecule in the dataset.
  3. Split data: Split the dataset into training and test sets using the train_test_split() function from the scikit-learn library.
  4. Train model: Train a random forest regressor model on the training set using the RandomForestRegressor class from the scikit-learn library.
  5. Evaluate model: Evaluate the performance of the trained model on the test set using the r2_score() function from the scikit-learn library.
  6. Visualize results: Use the plotly.express library to create a scatter plot of the predicted vs experimental lipophilicity values of the test set. Add a diagonal line to indicate perfect agreement between the predicted and experimental values.
  7. Design new molecules: Use Streamlit to create an interface for designing new molecules by selecting a number of functional groups to be added to a given molecular scaffold.
  8. Generate molecular scaffold: Generate the molecular scaffold of the new molecule by combining the selected functional groups with a pre-defined scaffold using the RDKit library.
  9. Add functional groups to molecule scaffold: Add the selected functional groups to the molecular scaffold to create the final molecule using the RDKit library.
  10. Visualize final molecule: Use the RDKit library to visualize the final molecule.

Technologies Used

  1. Python: The project is written in Python, which is a popular high-level programming language used for a variety of purposes, including data analysis, machine learning, and web development.
  2. Streamlit: Streamlit is an open-source Python library used for creating interactive web applications and data visualizations.
  3. pandas: pandas is a popular data manipulation library in Python used for data analysis and cleaning.
  4. NumPy: NumPy is a scientific computing library in Python used for numerical operations and computations.
  5. RDKit: RDKit is an open-source software development kit for cheminformatics and computational chemistry.
  6. Scikit-learn: Scikit-learn is a machine learning library in Python used for building and evaluating predictive models.
  7. plotly: plotly is a Python data visualization library used for creating interactive charts and graphs.
  8. Intel Optimization: Intel Optimization is a set of tools and techniques for optimizing performance on Intel processors, including the use of Intel distribution packages for Python libraries.

Collaborators

There are no people to show.

Comments (0)