We can perform expansions in higher dimensions. For example, we can expand the simple linear models ψ(x)=[1,x1,x2]T into quadratic form ψ(x)=[1,x1,x2,x12,x22,x1x2]T. By doing so, we are still fitting linear models, but with more, higher-dimensional features.
Using degree d polynomials in k dimensions results in O(kd) features.
We can use kernels as the features. For some expansions, a kernel computes the dot product, i.e. κ(x′,x):X×X→R,κ(x′,x)=ϕ(x′)⋅ϕ(x).
Some examples of kernels:
Polynomial kernel: κpoly(x′,x)=(x⋅x′+θ)d.
Radial Basis Function (RBF) kernel: κRBF(x′,x)=exp(−γ∣∣x−x′∣∣2).
Let C represents the number of ways to distribute d balls into k bins, where the j-th bin holds nj≥0 balls. Assume that zi=xixi′ in the above binomial expansion. Then we have
where ∑ni≥0,∑ini=d is the dimension of ϕpoly, C(d;n1,⋯,nk)x1n1⋯xknk is one row in ϕpoly(x) and C(d;n1,⋯,nk)(x1′)n1⋯(xk′)nk is one row in ϕpoly(x′).
The dimension of the vector ϕpoly(x) and ϕpoly(x′) is O(kd).
The complexity for computing κ(x′,x) is O(kd) using ϕpoly vs. O(klogd) using (xx′)d.
RBF Kernel
A radial basis function (RBF) kernel is defined as
κRBF(x′,x)=exp(−γ∣∣x−x′∣∣2)=eγ∣∣x−x′∣∣21
The hyperparameters in RBF kernel are the width γ and the centres x′.
RBF acts as a similarity measurement between x and x′: κ(x′,x)∈(0,1].
If γ∣∣x−x′∣∣→∞, then κRBF(x′,x)→0. This happens when x and x′ are far apart (large distance) or γ is very large.
If γ∣∣x−x′∣∣→0, then κRBF(x′,x)→1. This happens when x and x′ are close (small distance) or γ is very small.
in which (k!1exp(−∣∣x∣∣2)ϕpoly(x)) is the k-th row in ϕRBF(x) and (k!1exp(−∣∣x′∣∣2)ϕpoly(x′)) is the k-th row in ϕRBF(x′).
Note that ϕRBF:Rd→R∞ projects vectors into an infinite dimensional space, and hence it is not feasible to compute κRBF(x′,x) using ϕRBF.
Linear Model with RBF Kernel
First we choose centres μ1,⋯,μM, and then the feature maps become
ψRBF(x)=[1,κRBF(μ1,x),⋯,κRBF(μM,x)]
where we have
κRBF(x′,x)=exp(−γ∣∣x−x′∣∣2)
Then the output from a linear model is
y=w0+i=1∑MwiκRBF(μ1,x)+ϵ=wψ(x)+ϵ
When choosing hyperparameters:
The data points themselves can be centres.
For the width parameter, overfitting or underfitting may occur.
If the kernel is too narrow (γ is too large), then overfitting occurs.
If the kernel is too wide (γ is too small), then underfitting occurs.
The choice for width parameter is similar with the choice for degree in polynomial basis expansion.
Generalization of Machine Learning
The fundamental goal of machine learning is to generalize beyond the examples in the training set. Hence the test data is highly important, but the data along is not enough.
Every learner must embody knowledge/assumptions beyond the data.
No free lunch: no learner can beat random guessing over all possible functions to be learned.
Hope: Functions to learn in the real world are not drawn uniformly from the set of all mathematically possible functions.
Reasonable assumptions: similar examples have similar classes; limited dependencies; limited complexity.
Bias-Variance Tradeoff
Having high bias means that we are underfitting, while having high variance means we are overfitting.
To combat overfitting, or high variance, there are several approaches:
Cross Validation.
Regularisation term to the evaluation function. It penalise models with more structure and favors smaller models with less room to overfit.
Statistical Significance Test: like chi-square before adding new structure. It decide whether the prediction is different with or without this structure.
Avoiding overfitting may lead to high bias, a.k.a underfitting.