Linear Regression
Linear Models
Suppose the input is a vector x∈Rd and the output is y∈R, and we have the datapoints (x1,y1),⋯,(xn,yn). Then we define a linear model as
y=w0+x1w1+⋯+xDwD+ϵ where we call the w0 as the bias/intercept and the ϵ as the noise/uncertainty.
Learning of Linear Models
Least Squares Objective Function
Assume we have (x1,y1),⋯,(xN,yN) where xi,yi∈R, and we have the predicted output as y^=w0+x⋅w (without noise term). We define the least square objective function as
L(w)=L(w0,w)=2N1i=1∑N(yi^−yi)2=2N1i=1∑N(w0+x⋅w−yi)2 This objective function is also called loss function, cost function, energy function, etc. It is known as the residual sum of squares or RSS. The estimated parameters (w0,w) is known as the least squares etimate.
One Dimensional case
With the objective function defined, we have
∂w0∂L=N1i=1∑N(w0+wxi−yi)∂w∂L=N1i=1∑N(w0+wxi−yi)xi We obtain the solution for w0,w by settingthe partial derivatives to 0 and solving the resulting system.
w0+w⋅N∑ixi=N∑iyiw0⋅N∑ixi+wN∑ixi2=N∑ixiyi Since we know these four values:
- x=N∑ixi.
- y=N∑iyi.
- Var^(x)=N∑ixi2−x2.
- Cov^(x,y)=N∑ixiyi−xy.
Then we end up with
w=Var^(x)Cov^(x,y)w0=y−wx General Case
In general case, we express everything in matrix notation. We have y^=Xw, in which y^∈RN×1, X∈RN×(D+1) and w∈R(D+1)×1. Then we define the loss function as
L(w)=2N1i=1∑N(xiTw−yi)2=2N1(Xw−y)T(Xw−y) Then we can calculate the gradient as
∇wL=N1((XTX)w−XTy) By letting the gradient ∇wL=0, we get (XTX)w=XTy and hence w=(XTX)−1XTy.
The predictions made by this model on the data X is given by
y^=Xw=X(XTX)−1XTy where X(XTX)−1XT is called the hat matrix.
Rank of the Matrix XTX
Above induction requires XTX to be invertible. Formally, we have the rank defined as
rank(XTX)=rank(X)≤min{D+1,N} Then XTX is invertible if rank(X)=D+1.
If we use one-hot encoding, where we have ∑Xi=1, we introduce some linear-denpendency in the columns of X and reduce the rank. In this case, we need to drop some features to adjust rank.
Complexity Analysis
There are several steps in calculating the weight matrix:
- Calculate XTX, since X∈R(D+1)×N, this step takes O(D2N).
- Calculate the inverse (XTX)−1, which takes O(D3).
- Calculate XTY, which takes O(DN) because Y∈RN×1.
- Calculate (XTX)−1XTY, which takes O(D2).
Hence the overall complexity is the sum of all these steps, i.e. O(D2N+D3+DN+D2)=O(D2N+D3).
Joint of Several Relations
If X is defined by a join of several relations, then the number of rows N may be exponential in the number of relations, i.e. N=O(Mnumber of relations).
If X is sparse, then it can be represented in O(M) space losslessly for acyclic joins, which are in common in practice.
In this case, W can still be calculated in O(D2M+D3).