Pearson’s Correlation Coefficient & Linear Regression with DPC++

2 0
  • 0 Collaborators

We Implement two Statistical Mathematical Algorithms such as Pearson’s Correlation Coefficient & Linear Regression with DPC++ and show you how to implement this algorithms in real life in sales and marketing to forecast Future sales based on advertising expenditure. ...learn more

Project status: Published/In Market

oneAPI, Artificial Intelligence, PC Skills

Intel Technologies
DPC++, oneAPI

Code Samples [1]Links [1]

Overview / Usage

Determine relationships and Forecast Future sales based on advertising expenditure using Pearson’s Correlation Coefficient & Linear Regression with DPC++

Regression analysis can be used in many places such as :

Health:

You can use regression & correlation to see if eating more sugar causes diabetes 

Food production:

You can see if using more fertilizer leads to more crop yield

Energy:

You can see if using dpc++ to do parallel processing reduces your electricity usage due to faster processing times .

The environment:

You can see if releasing more carbon dioxide into the atmosphere causes ice melt

Education:

You can see if Spending more time on study gets you better grades 

Social good:

You can see if you  give discounts to the poor if it results in them being able to afford basic food items.

It is simple but efficient Technology that enriches our lives by allowing us to view relationships ,forecast and make predictions to enhance operations while making better decisions and avoid costly problems.

Introduction

Linear regression is a useful tool to mathematically explain the relationship between to two things that have been observed and measured .

In the field of statistics this items are defined as the dependent variable and explanatory variable

Most commonly we use linear regression for 2 reasons :

  1. To establish mathematical relationships between observed data .
  2. To see whether the dependent variable and independent variable have a positive relationship , negative relationship or no relationship at all .
  3. To predict and forecast future possibilities' based on historical data.

In this article we implement computation heavy components of the linear regression algorithm parallelly using Intel DPC++ , so that we can achieve a faster processing time utilizing cpu ,gpu and fpga architectures.

Scenario As an Example We Implement two Statistical Mathematical Algorithms such as Pearson’s Correlation Coefficient & Linear Regression with DPC++ and show you how to implement this algorithms in real life such as sales and marketing to forecast Future sales based on advertising expenditure.

Problems solved

Implementing Mathematical algorithms with DPC++ Helps Solves problems such as .

  1. Increasing the Speed of processing and computation as its done parallelly.

  2. Producing optimized executable file sizes eg 200kb reducing bloatware.

3)Processing workloads across multiple architectures such as cpu ,gpu and fpga devices depending across windows and linux environments.

Methodology / Approach

  1. First we find a problem that we want to solve in our case it is to Forecast how much sales will be generated if we spend $50.00 on advertising and show the working files with it .

  2. We first identify a set of statistical formula's that we want to implement in DPC++ In our case the Pearson’s Correlation Coefficient formula and the Linear Regression formula.

  3. We convert the mathematical formula's into algorithms using bedmas and grouping.

  4. We identify and define variables from the formula to hold data.

  5. We identify the computation heavy part of the algorithms that can be parallelized to perform faster calculations.

Programming Stage

You can refer to our GitHub link to view the source code but you can read below steps that were taken to code the program.

We define our headers and name spaces.

#include<CL/sycl.hpp>
#include<array>
#include<iostream>
#include<cmath>
#include <math.h> 
#include<iomanip>
#include<limits>
#include <chrono>
using namespace sycl;

We select a device to use for processing in our case its queue

q(cpu\_selector{});

We initialize our x and y datasets , we load x and y data into 2 arrays .

int x[]={43,21,25,42,57,59}; // the amounts you spend on advertising each week  (data from USP st130 course)

 int y[]={99,65,79,75,87,81}; //the sales you get each week from (data from USP st130 course)

We initialize variables that will hold the computational results or be used in further processing and allocate shared memory work spaces ( below is example )

int sum\_y\_squared=0; //sum of y  squared values  

int\*xy=malloc\_shared<int>(N, q); //to hold xy calculated values  

We specify the calculations that need to be processed parallelly in each parallel_for method and save those computed data back to each respective arrays. ( below is example )

q.parallel\_for(range<1>(N), [=](id<1> i) {

   x\_squared[i]=pow(x[i],2);

}).wait();

Next define the formula and calculate the intercept coefficient a

intercept coefficient a :

double a=((sum\_y \* sum\_x\_squared)-(sum\_x \* sum\_xy)) / (N \* (sum\_x\_squared)-pow(sum\_x,2)); 

Next define the formula and calculate the slope coefficient b

 double b=(N\*(sum\_xy)-(sum\_x \* sum\_y))/(N\*(sum\_x\_squared)-pow(sum\_x,2));

Next define the formula and calculate the sales regression function

 double Sales\_regression\_function=a+(b\*(sample\_forecast));

Next define the formula and calculate the Pearson coefficient r

double pearson\_r=(N\*(sum\_xy)-(sum\_x\*sum\_y))/sqrt((N\*(sum\_x\_squared)-pow(sum\_x,2))\* (N\*(sum\_y\_squared)-pow(sum\_y,2)));

We finally output our results to a text file names regression.txt so that we can view the calculated regression forecasts and the working behind it .

Results Stage generating an output as follows inside regression.txt .

The Pearsons correlation is 0.529809

Using the formula y=a+(b\*50) We forcast that spending $50 on advertising can result in $84.4028 in sales

Number of dataset values N is 6

Sum of X values 247

Sum of Y values 486

Sum of X squared values 11409

Sum of Y squared values 40022

Sum of XY values 20485

The below is a csv format out put of the working files

X Value, Y Value,XY,X Squared,Y Squared,

43,99,4257,1849,9801

21,65,1365,441,4225

25,79,1975,625,6241

42,75,3150,1764,5625

57,87,4959,3249,7569

59,81,4779,3481,6561

Usage instructions :

You can find the code at our github repository mentioned below in the article.

Copy the entire structure including all files to Intel dev cloud .

Ensure that the Python 3.7 (Intel OneApi) kernal is running

Ensure that you are using the q file ,run_audit.sh and Makefile that is provided with this source code.

Ensure that file exist in lab/regression.cpp

Run the following jupyter notebook regression.ipynb

Than Run the following command

! chmod 755 q; chmod 755 run\\_audit.sh;if [ -x "$(command -v qsub)" ]; then ./q run\\_audit.sh; else ./run\\_audit.sh; fi

DPC++ Development Experience

Converting a mathematical formula and implementing it in DPC++ was straight forward , simple and easy to implement . Although we experienced race conditions when summing arrays parallelly but than we learned to resolve it .

We also learned that depending on the type of data we process such as int ,double or float It's easy to convert and have your code run across different cpu , gpu and FPG architectures .

Technologies Used

We use Intel DPC++ and Intel One API to implement the regression analysis and Pearson correlation coefficient formula's.

We use intel dev cloud to write , compile and run our application on .

Repository

https://github.com/prilcool/Intel-devmesh-codeproject-two

Comments (0)