Skip to main content

Linear Regression

Linear Models

Suppose the input is a vector xRdx\in\mathbb{R}^d and the output is yRy\in\mathbb{R}, and we have the datapoints (x1,y1),,(xn,yn)(x_1,y_1),\cdots,(x_n,y_n). Then we define a linear model as

y=w0+x1w1++xDwD+ϵy=w_0+x_1w_1+\cdots+x_Dw_D+\epsilon

where we call the w0w_0 as the bias/intercept and the ϵ\epsilon as the noise/uncertainty.

Learning of Linear Models

Least Squares Objective Function

Assume we have (x1,y1),,(xN,yN)(x_1,y_1),\cdots,(x_N,y_N) where xi,yiRx_i,y_i\in\mathbb{R}, and we have the predicted output as y^=w0+xw\hat{y}=w_0+x\cdot w (without noise term). We define the least square objective function as

L(w)=L(w0,w)=12Ni=1N(yi^yi)2=12Ni=1N(w0+xwyi)2\mathcal{L}(w)=\mathcal{L}(w_0,w)=\frac{1}{2N}\sum_{i=1}^{N}(\hat{y_i}-y_i)^2=\frac{1}{2N}\sum_{i=1}^{N}(w_0+x\cdot w-y_i)^2

This objective function is also called loss function, cost function, energy function, etc. It is known as the residual sum of squares or RSS. The estimated parameters (w0,w)(w_0,w) is known as the least squares etimate.

One Dimensional case

With the objective function defined, we have

Lw0=1Ni=1N(w0+wxiyi)Lw=1Ni=1N(w0+wxiyi)xi\frac{\partial\mathcal{L}}{\partial w_0}=\frac{1}{N}\sum_{i=1}^N(w_0+wx_i-y_i) \\ \frac{\partial\mathcal{L}}{\partial w}=\frac{1}{N}\sum_{i=1}^N(w_0+wx_i-y_i)x_i

We obtain the solution for w0,ww_0,w by settingthe partial derivatives to 00 and solving the resulting system.

w0+wixiN=iyiNw0ixiN+wixi2N=ixiyiNw_0+w\cdot\frac{\sum_ix_i}{N}=\frac{\sum_i y_i}{N} \\ w_0\cdot \frac{\sum_ix_i}{N}+w\frac{\sum_ix_i^2}{N}=\frac{\sum_ix_iy_i}{N}

Since we know these four values:

  • x=ixiN\overline{x}=\frac{\sum_ix_i}{N}.
  • y=iyiN\overline{y}=\frac{\sum_iy_i}{N}.
  • Var^(x)=ixi2Nx2\hat{Var}(x)=\frac{\sum_ix_i^2}{N}-\overline{x}^2.
  • Cov^(x,y)=ixiyiNxy\hat{Cov}(x,y)=\frac{\sum_ix_iy_i}{N}-\overline{x}\overline{y}.

Then we end up with

w=Cov^(x,y)Var^(x)w0=ywxw=\frac{\hat{Cov}(x,y)}{\hat{Var}(x)} \\ w_0=\overline{y}-w\overline{x}

General Case

In general case, we express everything in matrix notation. We have y^=Xw\hat{y}=Xw, in which y^RN×1\hat{y}\in\mathbb{R}^{N\times 1}, XRN×(D+1)X\in\mathbb{R}^{N\times(D+1)} and wR(D+1)×1w\in\mathbb{R}^{(D+1)\times 1}. Then we define the loss function as

L(w)=12Ni=1N(xiTwyi)2=12N(Xwy)T(Xwy)\mathcal{L}(w)=\frac{1}{2N}\sum_{i=1}^{N}(x_i^Tw-y_i)^2=\frac{1}{2N}(Xw-y)^T(Xw-y)

Then we can calculate the gradient as

wL=1N((XTX)wXTy)\nabla_w\mathcal{L}=\frac{1}{N}((X^TX)w-X^Ty)

By letting the gradient wL=0\nabla_w\mathcal{L}=0, we get (XTX)w=XTy(X^TX)w=X^Ty and hence w=(XTX)1XTyw=(X^TX)^{-1}X^Ty.

The predictions made by this model on the data XX is given by

y^=Xw=X(XTX)1XTy\hat{y}=Xw=X(X^TX)^{-1}X^Ty

where X(XTX)1XTX(X^TX)^{-1}X^T is called the hat matrix.

Rank of the Matrix XTXX^TX

Above induction requires XTXX^TX to be invertible. Formally, we have the rank defined as

rank(XTX)=rank(X)min{D+1,N}\text{rank}(X^TX)=\text{rank}(X)\leq \text{min}\{D+1, N\}

Then XTXX^TX is invertible if rank(X)=D+1\text{rank}(X)=D+1.

If we use one-hot encoding, where we have Xi=1\sum X_i=1, we introduce some linear-denpendency in the columns of XX and reduce the rank. In this case, we need to drop some features to adjust rank.

Complexity Analysis

There are several steps in calculating the weight matrix:

  • Calculate XTXX^TX, since XR(D+1)×NX\in\mathbb{R}^{(D+1)\times N}, this step takes O(D2N)O(D^2N).
  • Calculate the inverse (XTX)1(X^TX)^{-1}, which takes O(D3)O(D^3).
  • Calculate XTYX^TY, which takes O(DN)O(DN) because YRN×1Y\in\mathbb{R}^{N\times 1}.
  • Calculate (XTX)1XTY(X^TX)^{-1}X^TY, which takes O(D2)O(D^2).

Hence the overall complexity is the sum of all these steps, i.e. O(D2N+D3+DN+D2)=O(D2N+D3)O(D^2N+D^3+DN+D^2)=O(D^2N+D^3).

Joint of Several Relations

If XX is defined by a join of several relations, then the number of rows NN may be exponential in the number of relations, i.e. N=O(Mnumber of relations)N=O(M^{\text{number of relations}}).

If XX is sparse, then it can be represented in O(M)O(M) space losslessly for acyclic joins, which are in common in practice.

In this case, WW can still be calculated in O(D2M+D3)O(D^2M+D^3).