*Magda Gregorova* *24/4/2019*
In class we discussed data normalization as one of the very common transformations you may want to apply to your data. In this notebook you will explore its effects.
Download jupyter notebook Ex4: data normalization
Note: In classification problems centering the outputs does not make much sense as these usually encode non-numerical categories such as True/False or Male/Female.
Import python libraries to be used later and define a convenience function plot_wh
to plot quickly the height vs. the weight data.
import numpy as np
# fix random seed
np.random.seed(123)
import matplotlib.pyplot as plt
# ipython magic for plots in notebook
%matplotlib inline
# height-weight plotting function - quick and dirty!
def plot_wh(plt_x1, plt_x2, plt_y, plt_lab1, plt_lab2, plt_title):
plt.rcParams['figure.figsize'] = (10.0, 5) # set size of plot
plt.plot(plt_x1, plt_y, 'o', color='#1f77b4', label=plt_lab1)
plt.plot(plt_x2, plt_y, 'x', color='#1fb45a', label=plt_lab2)
plt.xlabel('height')
plt.ylabel('weight')
plt.title(plt_title)
plt.axvline(x=0, color='#d62728', linestyle='dashed')
plt.axhline(y=0, color='#d62728', linestyle='dashed')
plt.legend()
The height data are generated in centemeters and converted to inches thereafter (1 inch is 2.54 cm).
The weight data are generated using a linear relationship with the height (obviusly, in real life the relationship is much more complex).
sample_size = 100
# height in cm
ht_cm = np.random.normal(175, 7, sample_size)
# height in m
ht_in = ht_cm/2.54
# weight as linear function of height (with noise)
wt_kg = -95+0.96*ht_cm + np.random.normal(0,1,sample_size)
# plot height vs weight
plot_wh(ht_cm, ht_in, wt_kg, 'height_cm', 'height_in', 'Original data')
Check the plots of the original data and think about the slopes $w_1$ of the regression lines $$\text{weight}_{kg} = w_0^{(cm)} + w_1^{(cm)} \, \text{height}_{cm}\\ \text{weight}_{kg} = w_0^{(in)} + w_1^{(in)} \, \text{height}_{in}$$
**Think:** Which of these has higher slope?
Recall that the slope of a line is defined as a change in the output $y$ per unit of the input $x$ as $$w_1 = \frac{\Delta y}{\Delta x}$$
**Think:** When is the wight going to change more, when we change by 1 unit the height in centimeters or in inches?
Given this finding, which of these seems more imporant for predicting the weight?
Get mean and standard deviation of both height variables for reference.
print('Mean of height in cm =',ht_cm.mean())
print('Mean of height in inches =',ht_in.mean())
print('---')
print('Std of height in cm =',ht_cm.std())
print('Std of height in inches =',ht_in.std())
Centering is defined as $x_c = x - \bar{x}$, where $\bar{x}$ is the empirical mean of $x$.
**Code:** Fill in the code below to center the inputs.
If you do this correctly
**Pen & paper:** Proof that the mean of centred data is alwyas zero.
Hint
# center inputs - YOUR CODE
ht_cm_c =
ht_in_c =
print('Mean of centered height in cm =',ht_cm_c.mean())
print('Mean of centered height in inches =',ht_in_c.mean())
print('---')
print('Std of centered height in cm =',ht_cm_c.std())
print('Std of centered height in inches =',ht_in_c.std())
# plot height vs weight
plot_wh(ht_cm_c, ht_in_c, wt_kg, 'height_cm_c', 'height_in_c', 'Centered inputs')
**Code:** Fill in the code below to center the outpus.
If you do this correctly, the intercpet in the plots with $x=0$ should be roughly at $y=0$ => there is no need for intercept term $w_0$ or the extra $x_0$ variable fixed to one 1
in the linear regression!
# center outputs - YOUR CODE
wt_kg_c =
# plot height vs weight
plot_wh(ht_cm_c, ht_in_c, wt_kg_c, 'height_cm_c', 'height_in_c', 'Centered inputs and outputs')
For the normalization of the data we will use the empirical standard deviation $\sigma_x$ as a rescaling factor (sometimes called standardization).
The normalization is defined as $x_n = \frac{x - \bar{x}}{\sigma_x}$, where $\bar{x}$ and $\sigma_x$ are the empirical mean and standard deviation of $x$.
**Code:** Fill in the code below to standardize the inputs.
We usually normalize only the input data.
If you do this correctly
**Pen & paper:** Proof that the standard deviation of standardized data is always 1.
Hint
# normalize inputs - YOUR CODE
ht_cm_n =
ht_in_n =
print('Mean of normalized height in cm =',ht_cm_n.mean())
print('Mean of normalized height in m =',ht_in_n.mean())
print('---')
print('Std of normalized height in cm =',ht_cm_n.std())
print('Std of normalized height in m =',ht_in_n.std())
# plot height vs weight
plot_wh(ht_cm_n, ht_in_n, wt_kg_c, 'height_cm_n', 'height_in_n', 'Normalized inputs, centred outputs')