Journal of Data Science and Analysis | by Vikash Ruhil: October 2022

Predictive Analytics using Linear Regression in R

The goal of this project is to do predictive analysis using linear regression of a given data set. In the end, we used the data provided to train the model and later

predicted it on the basis of the validation data. I used R language for my analysis.

The language used: R IDE - RStudio

Phase 1: Data Preparation

Load the data from CSV file. Then Looking at the data set and understating the variable. In the training dataset, there were 1155 observations and 11 numeric variables as shown in the following snapshot.

# Load the data from CSV file rawData=read.csv("college_grad_data.csv", header=T)

# Convert the data into data.frame() data = data.frame(rawData)

# You can check your data by using head and specify number of rows head(data, 5)

# Get the structure of data by using str()

str(data)

'data.frame': 1155 obs. of 11 variables:

: int 3400 5600 4440 6300 2600 4104 11660 2970 8080 2610 ...
: int NA 27 78 57 6 46 88 40 47 NA ...
: num 14.3 32.8 18.9 16.7 16.5 25.3 14 19.4 11.4 20.1 ...
: int 42300 NA 47700 54600 NA 54800 52300 50300 36600 49300 ... : num 0.682 0.928 0.66 0.9 1 0.735 0.73 0.646 0.8550.868 ...

$ tuition
$ pcttop25
$ sf_ratio
$ fac_comp
$ accrate
$ graduat
$ pct_phd
$ fulltime
$ alumni
$ num_enrl
$ public private : int 0 1 0 0 0 0 1 0 1 0 ...

: int 40 55 51 69 36 50 72 76 44 33 ... : int 53 52 72 85 64 94 74 62 63 66 ... : num 92.8 70.3 87.8 90.5 47 ...
: int NA NA 8 18 NA 3 34 5 9 6 ...

: int 984 179 570 3070 644 1686 287 625 127 887 ...

Checking missing values, we can check NA values using various methods. In the above image, it has been shown that various attributes have NA values. Finally, I found all NA values and replaced them with the mean of missing values.

> summary(data)

> install.packages("psych")
also installing the dependency ‘mnormt’

trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.2/mnormt_2.1.1.tgz' Content type 'application/x-gzip'length 211871 bytes (206 KB) ==================================================
downloaded 206 KB

trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.2/psych_2.2.9.tgz' Content type 'application/x-gzip'length 3826607 bytes (3.6 MB) ==================================================
downloaded 3.6 MB

The downloaded binary packages are in/var/folders/2q/c872fz3s0s7clcp1cf8lf_jw0000gn/T//RtmpKbnz2S/downloaded_packages

> library(psych)
Attaching package: ‘psych’
The following object is masked from ‘package:Hmisc’:

describe
The following objects are masked

%+%, alpha

# get data summary using psych > psych::describe(data)

n 1155 983 1154 1007 1146 1072 1126 1129 972 1153 1155

from ‘package:ggplot2’:

mean 9360.15 52.66 14.78

52834.46 1 0.76

60.89 68.68 79.11 21.10

781.82 0.65

vars 1 2 3 4 5 6 7 8 9 10 public.private 11

tuition pcttop25 sf_ratio fac_comp accrate graduat pct_phd fulltime alumni num_enrl

# Replace NA with mean for pcttop25
> data$pcttop25[is.na(data$pcttop25)]<-mean(data$pcttop25,na.rm=TRUE)

# Replace NA with mean for fac_comp
> data$fac_comp[is.na(data$fac_comp)]<-mean(data$fac_comp,na.rm=TRUE)

For outlier detection using 1.5 * IQR (Inter quartile range), I use the probability function to substitute any value for the current process it is (.20, .80). Finally convert higher and lower values than IQR range, making them null. After making them null either you can remove them or substitute them.

Histograms:-

Thursday, October 20, 2022

Predictive Analytics using Linear Regression in R