Machine Learning by Andrew Ng

Source: Coursera

Introduction | Linear regression with one variable | Linear Algebra
Linear regression with multiple variables | Octave/Matlab
Logistics regression | Regularization
Neutral networks | Model | Forward propagation
Neutral networks | Cost function | Back propagation
ML | Advice | Fix | Design
Unsupervised | K-Means | PCA
Anomaly Detection | Gussian dist.
Learning with Large Datasets

Have a machine learn to do it by itself 

(without explicitly programmed)
  • big data mining: webclick, biology
  • applications cannot program by hand: auto-pilot, Natural Language Processing, Computer Vision 
  • self-customization program: recommendation
  • understand human learning

ML definition

  • Task, T
  • Experience, E
  • Performance, P
learn from E > respect to T > performance measure P
perform on T > measure P > improve E
practical way for applying learning algorithm

Supervised Learning

RIGHT answer given (i.e. training set)
  • regression: continuous
  • classification: discrete
1 to infinite number of features | attributes

Unsupervised Learning

no labels > structure of data / pattern
e.g. google news, genes, computing clusters, social network analysis, market segment,
  • Clustering
  • Non-clustering
high-level programming language / prototyping: GNU Octave software

Linear regression with one variable

Model and Cost function

  • Model: Training set > Learning algorithm > Hypothesis, h, predictor for the corresponding value of y given a x
  • Cost function (Square error function) for regression: minimize the error/difference between h(x) with input x's & actual output, y's
  • Goal: minimize cost function J(theta) of parameter theta
  • Contour plot/figure > Gradient Descent: linear regression with one variable
y^=hθ(x)=θ0+θ1x
J(θ0,θ1)=12mi=1m(y^iyi)2=12mi=1m(hθ(xi)yi)2

Gradient Descent algorithm

(repeat until convergence with simultaneous update on theta j)
θj:=θjαθjJ(θ0,θ1)
  • a := b assignment
  • a = b true assertion
  • alpha: learning rate
    • small, slow to descending
    • large to overshoot the minimum
  • derivative term (e.g. slop with one theta, slop = 0, reach at a local optima)

Gradient Descent algorithm for linear regression

(simultaneous update on theta 0 and theta 1)
repeat until convergence: {θ0:=θ1:=}θ0α1mi=1m(hθ(xi)yi)θ1α1mi=1m((hθ(xi)yi)xi)
  • convex function: bow shape, only one global optima
  • batch: use all training set/example

Linear regression as Matrix-Vector Multiplication

Predictor (m,1) = Data matrix (m,n) * parameter vector (n,1)

Multiple competing hypothesis as Matrix Matrix Multiplication

Predictor (m,o) = Data matrix (m,n) * Parameter matrix (n,o)

  • parallelism
  • SIMD stands for "Single Instruction Multiple Data

Identity matrix, I, I nxn
Inverse matrix
Singular / degenerate matrix
Transpose

Multiple features

  • multivariate linear regression

hθ(x)=[θ0θ1...θn]x0x1xn=θTx
  • Gradient Descent for Multiple Variables

}repeat until convergence:{θj:=θjα1mi=1m(hθ(x(i))y(i))x(i)jfor j := 0...n
  • Feature Scaling: , less skewed contour of the gradient descent
    • e.g. -1 to 1
    • mean normalization
  • Learning rate: 
    • plot cost function J(theta) v.s. # of iterations 
    • too small: slow convergence
    • too large: may not converge
    • try 3 times of alpha
  • Polynomial Regression
    • may simplify as multivariate linear
    • feature scaling to simplified multivariate linear  

Normal Equation

  • feature scaling is not necessary
  • no need to choose learning rate, alpha
  • don't need to iterate
  • need to compute inverse matrix, pinv  > close to n(3) cube
  • slow when # of features, n is large, e.g. n >= 10000
  • Noninvertibility
    • XtransposeX: if non-invertible (singular)
    • Octav: pinv (pseudo inverse), inv
    • do the right thing
      1. redundant features (linearly dependent)
      2. too many features (e.g. m<=n)
Prototyping language
  • Octave
  • MATLAB
  • Python
  • NumPy
  • R
A*B does a matrix multiply
A.*B does an element-wise multiplication
h(x) = theta' * x" vs. "h(x) = X * theta


Classification | Logistics regression

binary classification problem
Logistic regression model
  • Hypothesis: Sigmoid function / Logistic function, g(z)
hθ(x)=g(θTx)z=θTxg(z)=11+ez

  • Decision boundary: property of hypothesis parameter, not training set
θTx0y=1θTx<0y=0
  • Cost function: convex
J(θ)=m1i=1m[y(i) log(hθ(x(i)))+(1y(i)) log(1hθ(x(i)))]
h=g(Xθ)
J(θ)=1m(yTlog(h)(1y)Tlog(1h))



  • Gradient Descent
θ:=θmαXT(g(Xθ)y)
  • Advanced optimization: Conjugate gradient, BFGS, L-BFGS (fast & automatic alpha , but complex)
  • function (jVal, gradient) = costFunction(theta)
  • [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
  • Multiclass Classification

Problem solving

  • under-fit: high biased
  • over-fit: high variance, failed to generate new example 
    • Option1: Reduce numbers of feature
      • manually
      • model selection algorithm
    • Option2: Regularization
      • reduce the magnitude / value of features

Regularization

reduce over-fitting
  • cost function, regularizing all of parameters, theta
  • small value for parameter, theta: simpler hypothesis
  • lambda too large
    • Algorithm works fine. cannot hurt it
    • Algorithm failed to eliminate over-fitting. 
    • Algorithm results in under-fitting. (failed to fit even the training example well)
    • Gradient decent will fail to converge
  • Adding a new feature to the model always results in equal or better performance NOT on the training set
  • Adding many new features to the model makes it more likely to overfit the training set.

Regularized Linear Regression

  • Gradient Descent
  • Normal Equation

Regularized Logistic Regression

  • Cost Function

Neutral network

non-linear hypothesis
sensor > cortex
data > algorithm
neuron in the brain > spikes, pulse of electricity

  • Input wire: Dendrite > X
  • Nucleus (cell body) > Activation function, Neural network (Input, Hidden, Output layer)
    • a(i) j: activation of unit i in layer j
  • Output wire: Axon > h(x) of parameters, weight
x0x1x2x3a(2)1a(2)2a(2)3hθ(x)







Forward propagation

  • vectorization
  • the network architecture, learn from its own features

z(j)=Θ(j1)a(j1)



Θ(j1) with dimensions s_j\times (n+1)
vector a^{(j-1)} with height (n+1)


a(j)=g(z(j))

hΘ(x)=a(j+1)=g(z(j+1))

Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in logistic regression


Examples
  • AND, OR, NOT, NOR: logistics regression like
  • Multiclass classification: multiple output units, one-vs-all

Neutral network (Classification)

  • L: # of layers
  • S(L): # of units (excluding bias unit)
  • Binary classification: S(L) = 1
  • Multi-class classification: S(L) = K (K >=3)
  • Cost function (Logistics regression)
    • K = k th units
    • i = m th output
    • Theta (0) is NOT summed, i.e. i=1:S(l)
J(Θ)=1mi=1mk=1K[y(i)klog((hΘ(x(i)))k)+(1y(i)k)log(1(hΘ(x(i)))k)]+λ2ml=1L1i=1slj=1sl+1(Θ(l)j,i)2

Backpropagation

Gradient: compute cost, partial derivative cost
delta(j,l): ERROR of unit/node j of layer l

For training example t =1 to m:
  1. Set a^{(1)} := x^{(t)}
  2. Perform forward propagation to compute a^{(l)} for l=2,3,…,L

3. Using y^{(t)}, compute \delta^{(L)} = a^{(L)} - y^{(t)}

4. Compute \delta^{(L-1)}, \delta^{(L-2)},\dots,\delta^{(2)} using δ(l)=((Θ(l))Tδ(l+1)) . a(l) . (1a(l))
The g-prime derivative terms can also be written out as:
g'(z^{(l)}) = a^{(l)}\ .*\ (1 - a^{(l)})
5. \Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)} or with vectorization, \Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T
Hence we update our new \Delta matrix.
  • D(l)i,j:=1m(Δ(l)i,j+λΘ(l)i,j), if j≠0.
  • D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j} If j=0

Gradient Checking

  • gradApprox (numerical) ~~ DVec (back propagation)

Random initialization

  • zero initialization > identical hidden units (it doesn't work)
  • symmetry breaking

Practical neural network algorithm

  1. pick a network architecture
    • # of output units = number of classes
    • # of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
    • Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.
    • # of input units = dimension of features x^{(i)}x (i)
  2. Randomly initialize the weights
  3. Implement forward propagation to get hΘ(x(i)) for any x^{(i)}x (i)
  4. Implement the cost function
  5. Implement backpropagation to compute partial derivatives
  6. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking
  7. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta. (keep in mind that J(Θ) is NOT convex and thus we can end up in a LOCAL minimum instead.)
Autonomous drive

Advice & Guideline

debug a learning algorithm > Machine learning diagnostic
  • overfitting
  • underfitting
Model selection (degree of polynomial), d
  • Train: 60%
  • Cross Validation: 20%
  • Test: 20%
Bias & Variance, J

  • High bias (underfitting): high J(train); high J(cv) ~ J(train)
  • High variance (overfitting): low J(train); J(cv) >> J(train)


Regularization, lambda
Learning curve, m


Fruitful Fixes
  • Getting more training examples: high variance
  • Trying smaller sets of features: high variance
  • Adding features: high bias
  • Adding polynomial features: high bias
  • Decreasing λ: high bias
  • Increasing λ: high variance

Design

Building spam classifier

  • Collect lots of data, high frequent occurring words (1000-5000)
  • Develop sophisticated features (using email header data in spam emails)
  • Develop algorithms to process your input in different ways (recognizing misspellings in spam).
  • Error analysis
    • quick & dirty first (simple algorithm, one day)
    • learning curve
    • manually examine the example of error made in the validation
  • Porter Stemming

Handling skewed data (rare class)

  • Precision: true positive / (true positive + false positive
  • Recall: true positive / (true positive + true negative)
  • Trading off precision and recall: F score = 2 PR / (P+R)
Using large dataset

Support Vector Machine

SVM hypothesis

  • cost1(z): if y=1, z >= 1
  • cost0(z): if y=0, z <= -1
  • Optimize objective of logistic regression: A+lambda * B > C*A+B

Large margin classification

  • SVM decision boundary > margin
  • mathematics behind: ||u||, length of the vector, u
  • linear decision boundary
  • non-linear decision boundary

Kernel

  • compute similarity(Gaussian kernel): features based on land marks, l1, l2, l3
  • x ~ landmark 1, feature 1 = 1
  • choosing landmarks: X > f = similarity (x,l)
  • SVM with kernels: off-the -shelf software
  • C = (1/lambda)
    • large C > small lambda > high bias (underfitting)
    • small C > large lambda > high variance (overfitting)
  • Sigma square: threshold

SVM  in practice

  1. parameter theta: SVM package ( libSVM, liblinear, ...)
  2. Need to specify
    • parameter C
    • kernel, similarity: Gaussian, Polynomial, String, Chi-square, Histogram intersection
    • do perform feature scaling first
    • satisfy "Mercer's Thereom", converge
  3. multi-class classification
  4. logistics regression vs. SVM
    • # features, n >> # example, m : logistics regression or SVM w/o kernel
    • small n, intermediate m : Gaussian kernel
    • small n, large m : logistics regression or SVM w/o kernel
  5. NN: work well, but slow to train

WK8: Unsupervised Learning

Applications of Clustering
  • Market segment
  • social network analysis
  • organize datacenter cluster
  • astronomical data analysis

Principle Component Analysis (PCA)

  • find a direction/vector to minimize the projection error
  • reduce from n-dimension to k-dimension
  • NOT a linear regression
Reduce from n-dimension to k-dimension
  1. Data preproccessing
    • mean normalization
    • scale features by n-2 or sigma
  2. Covariance matrix, capital sigma
    • Sigma = (1/m)*X'*X
    • [u, s, v] = svd(Sigma): single value decomposition, n x n matrix
    • eig(sigma) 
  3. from [u, s, v] = svd(Sigma)
    • Ureduce = u[:,1:k]
    • z = Ureduce'*x
    • choosing k (# of principle components): s(1:k) / s(1:n) > 99%
Advice: apply PCA only to the training dataset (mapping Xtraining to z)
Application
  • Compression: reduce memory/disk to store data 
  • Compression: speed up the algorithm
  • Visualization: K=2 or 3
Bad use o PCA
  • to prevent overfitting (reduce features): should use regularization instead
  • apply PCA, without trying original/raw data: check if storage shortage/slow performance issue

Anomaly Detection

Gaussian (Normal) distribution
  • mu, sigma
Algorithm evaluation (skewed class)
  • true positive, false positive, true negative, false negative
  • precision / recall
  • F- score
vs. supervised training
  • small number (0-20) of positive examples
  • different types of (future) anomalies
Choose Gaussian features
  • hist(x,bin)
  • log transform
  • exponent
  • Error analysis
Multivariate Gaussian Distribution

Large scale

Online learning

Stochastic gradient descent

  • randomly shuffle datasets
  • during learning, check cost(theta,x,y), before update theta using x, y
  • plot average cost over 1000(say) iterations
  • check not-so-noisy converge
  • use smaller alpha (typically constant) , if cost increasing 

mini-batch gradient descent

  • batch, instead one example per iteration

Map-reduce & Data parallelism

  • split training set
  • parallel machines / multi-core machines
  • sums of the function over training set

Summary

Supervised Learning

Linear regression | Logistics regression | Neurel network | SVMs

Unsupervised Learning

K-mean | PCA | Anomaly detection

Special application

Recommender system | large scale ML

Advice on ML buildup

Bias-Variance | Regularization | Decide what to do next | Evaluation of learning algorithm | Learning curve | Error analysis | Ceiling analysis

Comments

Popular Posts