We can prevent a too general model from overfitting through regularisation.
We modify the loss function to include a penalty for too high parmeter:
$$J_{pen}(X,y,\vec w) = C J(X,y,\vec w) + \frac{1}{2}\vec w \cdot \vec w$$small values of $C$ means strong regularisation, large values of $C$ means weak regularisation.
We now look at a one-dimensional example. Say we have the relationship
$$ y = y(x) = 7-8x-\frac12 x^2+\frac12 x^3 + \epsilon$$where $\epsilon$ is gaussian noise of average 0 and unit variance.
We can fit the data using polynomials of different order $k$:
The fit is minimizing the least square objective:
$$J(x,y,w) = \sum_i \left (p_w(x^{(i)})-y^{(i)}\right)^2\;,\qquad p_w(x)=\sum\limits_{i=0}^{k} w_ix^i $$The third order fit gets close to the right coefficients:
A second order fit is not too bad:
The 10th order polynomial is clearly overfitting, the polynomial coefficients are very large:
Large values and large cancellations between coefficients is a sign of overfitting.
We can prevent large values of the coefficients if we modify the objective:
$$J_{pen}(x,y,w,\alpha) = J(x,y,w) + \alpha\frac12 \sum\limits_{i=0}^{k} w_i^2 $$As we have seen for the SVM loss function we have two terms in the loss function that push ther result in opposite and conflicting directions. Small values of $\alpha$ do not change the objective function much and correspond to mild regularisation. Large values of $\alpha$ impose a stronger constraint on the size of the coefficients, meaning more regularisation.
This is called ridge regression.
The regularised coefficients are a lot smaller then the unregularised ones:
In a normal problem we have no access to the underlying probability
It is often the case that we do not have much labelled data and keeping a significant portion of it for hyperparameter optimisation often feels like a waste.
We can use a technique called cross-validation to perform validation without loosing too much of the training set.
In k-fold cross-validation we partition the data sample randomly into $k$ subsamples. For each subsample we train our model on the remaining subsamples and use the subsample we chose to validate the model. This will give a set of $k$ estimates of the model parameters. We can use these to