Multiple Linear Regression Intuition Multiple Linear Regression presents linear relationship between mutiple independent variables and dependent variable.
Formula : $$y = b_0 + b_1 * X_1 + … + b_n * X_n$$
Multiple Linear Regression Implementaion 1 2 3 4 5 6 7 8 9 10 import numpy as npimport matplotlib.pyplot as pltimport pandas as pddataset = pd.read_csv('50_Startups.csv' ) dataset
R&D Spend
Administration
Marketing Spend
State
Profit
0
165349.20
136897.80
471784.10
New York
192261.83
1
162597.70
151377.59
443898.53
California
191792.06
2
153441.51
101145.55
407934.54
Florida
191050.39
3
144372.41
118671.85
383199.62
New York
182901.99
4
142107.34
91391.77
366168.42
Florida
166187.94
5
131876.90
99814.71
362861.36
New York
156991.12
6
134615.46
147198.87
127716.82
California
156122.51
7
130298.13
145530.06
323876.68
Florida
155752.60
8
120542.52
148718.95
311613.29
New York
152211.77
9
123334.88
108679.17
304981.62
California
149759.96
…
1 2 X = dataset.iloc[:, :-1 ].values y = dataset.iloc[:, 4 ].values
array([[165349.2, 136897.8, 471784.1, 'New York'],
[162597.7, 151377.59, 443898.53, 'California'],
[153441.51, 101145.55, 407934.54, 'Florida'],
[144372.41, 118671.85, 383199.62, 'New York'],
[142107.34, 91391.77, 366168.42, 'Florida'],
[131876.9, 99814.71, 362861.36, 'New York'],
[134615.46, 147198.87, 127716.82, 'California'],
[130298.13, 145530.06, 323876.68, 'Florida'],
[120542.52, 148718.95, 311613.29, 'New York'],
[123334.88, 108679.17, 304981.62, 'California'],
...], dtype=object)
array([ 192261.83, 191792.06, 191050.39, 182901.99, 166187.94,
156991.12, 156122.51, 155752.6 , 152211.77, 149759.96,
... ])
1 2 3 4 5 6 7 8 9 from sklearn.preprocessing import LabelEncoder, OneHotEncoderlabelEncoder_X = LabelEncoder() X[:, 3 ]= labelEncoder_X.fit_transform(X[:, 3 ]) oneHotEncoder = OneHotEncoder(categorical_features = [3 ]) X = oneHotEncoder.fit_transform(X).toarray() X
array([[ 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
1.65349200e+05, 1.36897800e+05, 4.71784100e+05],
[ 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.62597700e+05, 1.51377590e+05, 4.43898530e+05],
[ 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
1.53441510e+05, 1.01145550e+05, 4.07934540e+05],
[ 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
1.44372410e+05, 1.18671850e+05, 3.83199620e+05],
[ 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
1.42107340e+05, 9.13917700e+04, 3.66168420e+05],
[ 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
1.31876900e+05, 9.98147100e+04, 3.62861360e+05],
[ 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.34615460e+05, 1.47198870e+05, 1.27716820e+05],
[ 0.00000000e+00, 1.00000000e+00, 0.00000000e+00,
1.30298130e+05, 1.45530060e+05, 3.23876680e+05],
[ 0.00000000e+00, 0.00000000e+00, 1.00000000e+00,
1.20542520e+05, 1.48718950e+05, 3.11613290e+05],
[ 1.00000000e+00, 0.00000000e+00, 0.00000000e+00,
1.23334880e+05, 1.08679170e+05, 3.04981620e+05],
...])
During the categorical data encoding, we created three dummy variables to present ‘New York’, ‘California’ and ‘Florida’. Because they have strong co-relationship between each other, to avoid dummy variable trap we need to remove one dummy variable.
array([[ 0.00000000e+00, 1.00000000e+00, 1.65349200e+05,
1.36897800e+05, 4.71784100e+05],
[ 0.00000000e+00, 0.00000000e+00, 1.62597700e+05,
1.51377590e+05, 4.43898530e+05],
[ 1.00000000e+00, 0.00000000e+00, 1.53441510e+05,
1.01145550e+05, 4.07934540e+05],
[ 0.00000000e+00, 1.00000000e+00, 1.44372410e+05,
1.18671850e+05, 3.83199620e+05],
[ 1.00000000e+00, 0.00000000e+00, 1.42107340e+05,
9.13917700e+04, 3.66168420e+05],
[ 0.00000000e+00, 1.00000000e+00, 1.31876900e+05,
9.98147100e+04, 3.62861360e+05],
[ 0.00000000e+00, 0.00000000e+00, 1.34615460e+05,
1.47198870e+05, 1.27716820e+05],
[ 1.00000000e+00, 0.00000000e+00, 1.30298130e+05,
1.45530060e+05, 3.23876680e+05],
[ 0.00000000e+00, 1.00000000e+00, 1.20542520e+05,
1.48718950e+05, 3.11613290e+05],
[ 0.00000000e+00, 0.00000000e+00, 1.23334880e+05,
1.08679170e+05, 3.04981620e+05],
...])
1 2 3 4 5 6 7 8 9 10 11 12 13 14 from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2 , random_state = 0 ) from sklearn.linear_model import LinearRegressionregressor = LinearRegression() regressor.fit(X_train, y_train) y_pred = regressor.predict(X_test) y_pred
array([ 103015.20159795, 132582.27760817, 132447.73845176,
71976.09851257, 178537.48221058, 116161.24230165,
67851.69209675, 98791.73374686, 113969.43533013,
167921.06569553])
array([ 103282.38, 144259.4 , 146121.95, 77798.83, 191050.39,
105008.31, 81229.06, 97483.56, 110352.25, 166187.94])