Skip to main content

Expansion and Kernel Methods

We can perform expansions in higher dimensions. For example, we can expand the simple linear models ψ(x)=[1,x1,x2]T\psi(x)=[1,x_1,x_2]^T into quadratic form ψ(x)=[1,x1,x2,x12,x22,x1x2]T\psi(x)=[1,x_1,x_2,x_1^2,x_2^2,x_1x_2]^T. By doing so, we are still fitting linear models, but with more, higher-dimensional features.

Using degree dd polynomials in kk dimensions results in O(kd)O(k^d) features.

We can use kernels as the features. For some expansions, a kernel computes the dot product, i.e. κ(x,x):X×XR,κ(x,x)=ϕ(x)ϕ(x)\kappa(x',x):\mathcal{X}\times\mathcal{X}\to\mathbb{R}, \kappa(x',x)=\phi(x')\cdot\phi(x).

Some examples of kernels:

  • Polynomial kernel: κpoly(x,x)=(xx+θ)d\kappa_{\text{poly}}(x',x)=(x\cdot x'+\theta)^{d}.
  • Radial Basis Function (RBF) kernel: κRBF(x,x)=exp(γxx2)\kappa_{\text{RBF}}(x',x)=\text{exp}(-\gamma||x-x'||^2).

Polynomial Kernel

The polynomial kernel is defined as

κpoly(x,x)=(xx+θ)d\kappa_{\text{poly}}(x',x)=(x\cdot x'+\theta)^{d}

Since we have the binomial expansion as

(z1++zk)d=ni0,ini=dd!n1!nk!z1n1zknk(z_1+\cdots+z_k)^d=\sum_{n_i\geq0, \sum_i n_i=d} \frac{d!}{n_1!\cdots n_k!}z_1^{n_1}\cdots z_k^{n_k}

Let CC represents the number of ways to distribute dd balls into kk bins, where the jj-th bin holds nj0n_j\geq 0 balls. Assume that zi=xixiz_i=x_ix_i' in the above binomial expansion. Then we have

(xx)d=ni0,ini=dC(d;n1,,nk)x1n1xknkC(d;n1,,nk)(x1)n1(xk)nk(xx')^d=\sum_{n_i\geq0, \sum_i n_i=d} \sqrt{C(d;n_1,\cdots,n_k)}x_1^{n_1}\cdots x_k^{n_k}\sqrt{C(d;n_1,\cdots,n_k)}(x_1')^{n_1}\cdots(x_k')^{n_k}

where ni0,ini=d\sum_{n_i\geq0, \sum_i n_i=d} is the dimension of ϕpoly\phi_{\text{poly}}, C(d;n1,,nk)x1n1xknk\sqrt{C(d;n_1,\cdots,n_k)}x_1^{n_1}\cdots x_k^{n_k} is one row in ϕpoly(x)\phi_{\text{poly}}(x) and C(d;n1,,nk)(x1)n1(xk)nk\sqrt{C(d;n_1,\cdots,n_k)}(x_1')^{n_1}\cdots(x_k')^{n_k} is one row in ϕpoly(x)\phi_{\text{poly}}(x').

The dimension of the vector ϕpoly(x)\phi_{\text{poly}}(x) and ϕpoly(x)\phi_{\text{poly}}(x') is O(kd)O(k^d).

The complexity for computing κ(x,x)\kappa(x',x) is O(kd)O(k^d) using ϕpoly\phi_{\text{poly}} vs. O(klogd)O(k\text{log}d) using (xx)d(xx')^d.

RBF Kernel

A radial basis function (RBF) kernel is defined as

κRBF(x,x)=exp(γxx2)=1eγxx2\kappa_{\text{RBF}}(x',x)=\text{exp}(-\gamma||x-x'||^2)=\frac{1}{e^{\gamma||x-x'||^2}}

The hyperparameters in RBF kernel are the width γ\gamma and the centres xx'.

RBF acts as a similarity measurement between xx and xx': κ(x,x)(0,1]\kappa(x',x)\in(0,1].

  • If γxx\gamma||x-x'||\to\infty, then κRBF(x,x)0\kappa_{\text{RBF}}(x',x)\to0. This happens when xx and xx' are far apart (large distance) or γ\gamma is very large.
  • If γxx0\gamma||x-x'||\to0, then κRBF(x,x)1\kappa_{\text{RBF}}(x',x)\to1. This happens when xx and xx' are close (small distance) or γ\gamma is very small.

Since we have

xx2=<xx,xx>=<x,x>+<x,x>2<x,x>=x2+x22xx||x-x'||^2=<x-x',x-x'>=<x,x>+<x',x'>-2<x,x'>\\ =||x||^2+||x'||^2-2xx'

and

exp(x,x)=k=01k!(xx)k\text{exp}(x,x')=\sum_{k=0}^{\infty}\frac{1}{k!}(xx')^k

and the taylor expansion:

exp(z)=k=0zkk!\text{exp}(z)=\sum_{k=0}^\infty\frac{z^k}{k!}

Without loss of generality, assume γ=1\gamma=1, then we have

exp(γxx2)=k=0(1k!exp(x2)ϕpoly(x))(1k!exp(x2)ϕpoly(x))\text{exp}(-\gamma||x-x'||^2)=\sum_{k=0}^{\infty}(\sqrt{\frac{1}{k!}}\text{exp}(-||x||^2)\phi_\text{poly}(x))(\sqrt{\frac{1}{k!}}\text{exp}(-||x'||^2)\phi_\text{poly}(x'))

in which (1k!exp(x2)ϕpoly(x))(\sqrt{\frac{1}{k!}}\text{exp}(-||x||^2)\phi_\text{poly}(x)) is the kk-th row in ϕRBF(x)\phi_\text{RBF}(x) and (1k!exp(x2)ϕpoly(x))(\sqrt{\frac{1}{k!}}\text{exp}(-||x'||^2)\phi_\text{poly}(x')) is the kk-th row in ϕRBF(x)\phi_\text{RBF}(x').

Note that ϕRBF:RdR\phi_\text{RBF}:\mathbb{R}^d\to\mathbb{R}^{\infty} projects vectors into an infinite dimensional space, and hence it is not feasible to compute κRBF(x,x)\kappa_\text{RBF}(x',x) using ϕRBF\phi_\text{RBF}.

Linear Model with RBF Kernel

First we choose centres μ1,,μM\mu_1,\cdots,\mu_M, and then the feature maps become

ψRBF(x)=[1,κRBF(μ1,x),,κRBF(μM,x)]\psi_\text{RBF}(x)=[1,\kappa_\text{RBF}(\mu_1,x),\cdots,\kappa_\text{RBF}(\mu_M,x)]

where we have

κRBF(x,x)=exp(γxx2)\kappa_\text{RBF}(x',x)=\text{exp}(-\gamma||x-x'||^2)

Then the output from a linear model is

y=w0+i=1MwiκRBF(μ1,x)+ϵ=wψ(x)+ϵy=w_0+\sum_{i=1}^Mw_i\kappa_\text{RBF}(\mu_1,x)+\epsilon=w\psi(x)+\epsilon

When choosing hyperparameters:

  • The data points themselves can be centres.
  • For the width parameter, overfitting or underfitting may occur.
    • If the kernel is too narrow (γ\gamma is too large), then overfitting occurs.
    • If the kernel is too wide (γ\gamma is too small), then underfitting occurs.

The choice for width parameter is similar with the choice for degree in polynomial basis expansion.

Generalization of Machine Learning

The fundamental goal of machine learning is to generalize beyond the examples in the training set. Hence the test data is highly important, but the data along is not enough.

  • Every learner must embody knowledge/assumptions beyond the data.
  • No free lunch: no learner can beat random guessing over all possible functions to be learned.
  • Hope: Functions to learn in the real world are not drawn uniformly from the set of all mathematically possible functions.
  • Reasonable assumptions: similar examples have similar classes; limited dependencies; limited complexity.

Bias-Variance Tradeoff

Having high bias means that we are underfitting, while having high variance means we are overfitting.

To combat overfitting, or high variance, there are several approaches:

  • Cross Validation.
  • Regularisation term to the evaluation function. It penalise models with more structure and favors smaller models with less room to overfit.
  • Statistical Significance Test: like chi-square before adding new structure. It decide whether the prediction is different with or without this structure.

Avoiding overfitting may lead to high bias, a.k.a underfitting.