M6C Data Science I

*Magda Gregorova* *24/4/2019*

Ex4: data normalization¶

In class we discussed data normalization as one of the very common transformations you may want to apply to your data. In this notebook you will explore its effects.

Download jupyter notebook Ex4: data normalization

Note: In classification problems centering the outputs does not make much sense as these usually encode non-numerical categories such as True/False or Male/Female.

Set-up¶

Import python libraries to be used later and define a convenience function plot_wh to plot quickly the height vs. the weight data.

import numpy as np
# fix random seed
np.random.seed(123)

import matplotlib.pyplot as plt 
# ipython magic for plots in notebook
%matplotlib inline

# height-weight plotting function - quick and dirty!
def plot_wh(plt_x1, plt_x2, plt_y, plt_lab1, plt_lab2, plt_title):
    plt.rcParams['figure.figsize'] = (10.0, 5) # set size of plot
    plt.plot(plt_x1, plt_y, 'o', color='#1f77b4', label=plt_lab1)
    plt.plot(plt_x2, plt_y, 'x', color='#1fb45a', label=plt_lab2)
    plt.xlabel('height')
    plt.ylabel('weight')
    plt.title(plt_title)
    plt.axvline(x=0, color='#d62728', linestyle='dashed')
    plt.axhline(y=0, color='#d62728', linestyle='dashed')
    plt.legend()

Generate random data for height and weight.¶

The height data are generated in centemeters and converted to inches thereafter (1 inch is 2.54 cm).
The weight data are generated using a linear relationship with the height (obviusly, in real life the relationship is much more complex).

sample_size = 100

# height in cm
ht_cm = np.random.normal(175, 7, sample_size)
# height in m
ht_in = ht_cm/2.54

# weight as linear function of height (with noise)
wt_kg = -95+0.96*ht_cm + np.random.normal(0,1,sample_size)

# plot height vs weight
plot_wh(ht_cm, ht_in, wt_kg, 'height_cm', 'height_in', 'Original data')

Explore the plot¶

Check the plots of the original data and think about the slopes $w_1$ of the regression lines $$\text{weight}_{kg} = w_0^{(cm)} + w_1^{(cm)} \, \text{height}_{cm}\\ \text{weight}_{kg} = w_0^{(in)} + w_1^{(in)} \, \text{height}_{in}$$

**Think:** Which of these has higher slope?

Recall that the slope of a line is defined as a change in the output $y$ per unit of the input $x$ as $$w_1 = \frac{\Delta y}{\Delta x}$$

**Think:** When is the wight going to change more, when we change by 1 unit the height in centimeters or in inches?
Given this finding, which of these seems more imporant for predicting the weight?

Data normalization¶

Get mean and standard deviation of both height variables for reference.

print('Mean of height in cm =',ht_cm.mean())
print('Mean of height in inches =',ht_in.mean())
print('---')
print('Std of height in cm =',ht_cm.std())
print('Std of height in inches =',ht_in.std())

Centering¶

Effects of input centering¶

Centering is defined as $x_c = x - \bar{x}$, where $\bar{x}$ is the empirical mean of $x$.

**Code:** Fill in the code below to center the inputs.

If you do this correctly

the new means should be zeros (very small numbers due to computer numerical precision)
the standard deviations should remain the same.
there should be approximately half of the data in the left and right of the $x=0$ axis
the slope of the lines should not change (careful, note that the scales on the horizontal and vertical axis of the graph have probably changed so you cannot compare the slopes with the previous graph simply visually)

**Pen & paper:** Proof that the mean of centred data is alwyas zero.
Hint

# center inputs - YOUR CODE
ht_cm_c = 
ht_in_c = 

print('Mean of centered height in cm =',ht_cm_c.mean())
print('Mean of centered height in inches =',ht_in_c.mean())
print('---')
print('Std of centered height in cm =',ht_cm_c.std())
print('Std of centered height in inches =',ht_in_c.std())

# plot height vs weight
plot_wh(ht_cm_c, ht_in_c, wt_kg, 'height_cm_c', 'height_in_c', 'Centered inputs')

Effects of ouput centering¶

**Code:** Fill in the code below to center the outpus.

If you do this correctly, the intercpet in the plots with $x=0$ should be roughly at $y=0$ => there is no need for intercept term $w_0$ or the extra $x_0$ variable fixed to one 1 in the linear regression!

# center outputs - YOUR CODE
wt_kg_c = 

# plot height vs weight
plot_wh(ht_cm_c, ht_in_c, wt_kg_c, 'height_cm_c', 'height_in_c', 'Centered inputs and outputs')

Normalization¶

For the normalization of the data we will use the empirical standard deviation $\sigma_x$ as a rescaling factor (sometimes called standardization).

The normalization is defined as $x_n = \frac{x - \bar{x}}{\sigma_x}$, where $\bar{x}$ and $\sigma_x$ are the empirical mean and standard deviation of $x$.

**Code:** Fill in the code below to standardize the inputs.

We usually normalize only the input data.

If you do this correctly

the means should be zeros
the standard deviations should be one (up to numerical precision)
the data in the plot should overlap - we have shifted them to common mean (zero) and aligned their scales (neighter centimeters nor inches now)

**Pen & paper:** Proof that the standard deviation of standardized data is always 1.
Hint

# normalize inputs - YOUR CODE
ht_cm_n = 
ht_in_n = 

print('Mean of normalized height in cm =',ht_cm_n.mean())
print('Mean of normalized height in m =',ht_in_n.mean())
print('---')
print('Std of normalized height in cm =',ht_cm_n.std())
print('Std of normalized height in m =',ht_in_n.std())

# plot height vs weight
plot_wh(ht_cm_n, ht_in_n, wt_kg_c, 'height_cm_n', 'height_in_n', 'Normalized inputs, centred outputs')