Pearson’s Correlation Coefficient & Linear Regression with DPC++
Prilvesh Krishna
Unknown
- 0 Collaborators
We Implement two Statistical Mathematical Algorithms such as Pearson’s Correlation Coefficient & Linear Regression with DPC++ and show you how to implement this algorithms in real life in sales and marketing to forecast Future sales based on advertising expenditure. ...learn more
Project status: Published/In Market
oneAPI, Artificial Intelligence, PC Skills
Overview / Usage
Determine relationships and Forecast Future sales based on advertising expenditure using Pearson’s Correlation Coefficient & Linear Regression with DPC++
Regression analysis can be used in many places such as :
Health:
You can use regression & correlation to see if eating more sugar causes diabetes
Food production:
You can see if using more fertilizer leads to more crop yield
Energy:
You can see if using dpc++ to do parallel processing reduces your electricity usage due to faster processing times .
The environment:
You can see if releasing more carbon dioxide into the atmosphere causes ice melt
Education:
You can see if Spending more time on study gets you better grades
Social good:
You can see if you give discounts to the poor if it results in them being able to afford basic food items.
It is simple but efficient Technology that enriches our lives by allowing us to view relationships ,forecast and make predictions to enhance operations while making better decisions and avoid costly problems.
Introduction
Linear regression is a useful tool to mathematically explain the relationship between to two things that have been observed and measured .
In the field of statistics this items are defined as the dependent variable and explanatory variable
Most commonly we use linear regression for 2 reasons :
- To establish mathematical relationships between observed data .
- To see whether the dependent variable and independent variable have a positive relationship , negative relationship or no relationship at all .
- To predict and forecast future possibilities' based on historical data.
In this article we implement computation heavy components of the linear regression algorithm parallelly using Intel DPC++ , so that we can achieve a faster processing time utilizing cpu ,gpu and fpga architectures.
Scenario As an Example We Implement two Statistical Mathematical Algorithms such as Pearson’s Correlation Coefficient & Linear Regression with DPC++ and show you how to implement this algorithms in real life such as sales and marketing to forecast Future sales based on advertising expenditure.
Problems solved
Implementing Mathematical algorithms with DPC++ Helps Solves problems such as .
-
Increasing the Speed of processing and computation as its done parallelly.
-
Producing optimized executable file sizes eg 200kb reducing bloatware.
3)Processing workloads across multiple architectures such as cpu ,gpu and fpga devices depending across windows and linux environments.
Methodology / Approach
-
First we find a problem that we want to solve in our case it is to Forecast how much sales will be generated if we spend $50.00 on advertising and show the working files with it .
-
We first identify a set of statistical formula's that we want to implement in DPC++ In our case the Pearson’s Correlation Coefficient formula and the Linear Regression formula.
-
We convert the mathematical formula's into algorithms using bedmas and grouping.
-
We identify and define variables from the formula to hold data.
-
We identify the computation heavy part of the algorithms that can be parallelized to perform faster calculations.
Programming Stage
You can refer to our GitHub link to view the source code but you can read below steps that were taken to code the program.
We define our headers and name spaces.
#include<CL/sycl.hpp>
#include<array>
#include<iostream>
#include<cmath>
#include <math.h>
#include<iomanip>
#include<limits>
#include <chrono>
using namespace sycl;
We select a device to use for processing in our case its queue
q(cpu\_selector{});
We initialize our x and y datasets , we load x and y data into 2 arrays .
int x[]={43,21,25,42,57,59}; // the amounts you spend on advertising each week (data from USP st130 course)
int y[]={99,65,79,75,87,81}; //the sales you get each week from (data from USP st130 course)
We initialize variables that will hold the computational results or be used in further processing and allocate shared memory work spaces ( below is example )
int sum\_y\_squared=0; //sum of y squared values
int\*xy=malloc\_shared<int>(N, q); //to hold xy calculated values
We specify the calculations that need to be processed parallelly in each parallel_for method and save those computed data back to each respective arrays. ( below is example )
q.parallel\_for(range<1>(N), [=](id<1> i) {
x\_squared[i]=pow(x[i],2);
}).wait();
Next define the formula and calculate the intercept coefficient a
intercept coefficient a :
double a=((sum\_y \* sum\_x\_squared)-(sum\_x \* sum\_xy)) / (N \* (sum\_x\_squared)-pow(sum\_x,2));
Next define the formula and calculate the slope coefficient b
double b=(N\*(sum\_xy)-(sum\_x \* sum\_y))/(N\*(sum\_x\_squared)-pow(sum\_x,2));
Next define the formula and calculate the sales regression function
double Sales\_regression\_function=a+(b\*(sample\_forecast));
Next define the formula and calculate the Pearson coefficient r
double pearson\_r=(N\*(sum\_xy)-(sum\_x\*sum\_y))/sqrt((N\*(sum\_x\_squared)-pow(sum\_x,2))\* (N\*(sum\_y\_squared)-pow(sum\_y,2)));
We finally output our results to a text file names regression.txt so that we can view the calculated regression forecasts and the working behind it .
Results Stage generating an output as follows inside regression.txt .
The Pearsons correlation is 0.529809
Using the formula y=a+(b\*50) We forcast that spending $50 on advertising can result in $84.4028 in sales
Number of dataset values N is 6
Sum of X values 247
Sum of Y values 486
Sum of X squared values 11409
Sum of Y squared values 40022
Sum of XY values 20485
The below is a csv format out put of the working files
X Value, Y Value,XY,X Squared,Y Squared,
43,99,4257,1849,9801
21,65,1365,441,4225
25,79,1975,625,6241
42,75,3150,1764,5625
57,87,4959,3249,7569
59,81,4779,3481,6561
Usage instructions :
You can find the code at our github repository mentioned below in the article.
Copy the entire structure including all files to Intel dev cloud .
Ensure that the Python 3.7 (Intel OneApi) kernal is running
Ensure that you are using the q file ,run_audit.sh and Makefile that is provided with this source code.
Ensure that file exist in lab/regression.cpp
Run the following jupyter notebook regression.ipynb
Than Run the following command
! chmod 755 q; chmod 755 run\\_audit.sh;if [ -x "$(command -v qsub)" ]; then ./q run\\_audit.sh; else ./run\\_audit.sh; fi
DPC++ Development Experience
Converting a mathematical formula and implementing it in DPC++ was straight forward , simple and easy to implement . Although we experienced race conditions when summing arrays parallelly but than we learned to resolve it .
We also learned that depending on the type of data we process such as int ,double or float It's easy to convert and have your code run across different cpu , gpu and FPG architectures .
Technologies Used
We use Intel DPC++ and Intel One API to implement the regression analysis and Pearson correlation coefficient formula's.
We use intel dev cloud to write , compile and run our application on .
Repository
https://github.com/prilcool/Intel-devmesh-codeproject-two