Machine Learning by Andrew Ng

11/18/2019

Machine Learning by Andrew Ng

Have a machine learn to do it by itself

(without explicitly programmed)

big data mining: webclick, biology
applications cannot program by hand: auto-pilot, Natural Language Processing, Computer Vision
self-customization program: recommendation
understand human learning

ML definition

Task, T
Experience, E
Performance, P

learn from E > respect to T > performance measure P
perform on T > measure P > improve E

practical way for applying learning algorithm

Supervised Learning

RIGHT answer given (i.e. training set)

regression: continuous
classification: discrete

1 to infinite number of features | attributes

Unsupervised Learning

no labels > structure of data / pattern
e.g. google news, genes, computing clusters, social network analysis, market segment,

Clustering
Non-clustering

high-level programming language / prototyping: GNU Octave software

Linear regression with one variable

Model and Cost function

Model: Training set > Learning algorithm > Hypothesis, h, predictor for the corresponding value of y given a x
Cost function (Square error function) for regression: minimize the error/difference between h(x) with input x's & actual output, y's
Goal: minimize cost function J(theta) of parameter theta
Contour plot/figure > Gradient Descent: linear regression with one variable

y^=hθ(x)=θ0+θ1x

J(θ0,θ1)=12m∑i=1m(y^i−yi)2=12m∑i=1m(hθ(xi)−yi)2

Gradient Descent algorithm

(repeat until convergence with simultaneous update on theta j)
θj:=θj−α∂∂θjJ(θ0,θ1)

a := b assignment
a = b true assertion
alpha: learning rate

small, slow to descending
large to overshoot the minimum

derivative term (e.g. slop with one theta, slop = 0, reach at a local optima)

Gradient Descent algorithm for linear regression

(simultaneous update on theta 0 and theta 1)
repeat until convergence: {θ0:=θ1:=}θ0−α1m∑i=1m(hθ(xi)−yi)θ1−α1m∑i=1m((hθ(xi)−yi)xi)

convex function: bow shape, only one global optima
batch: use all training set/example

Linear regression with multiple variables

Linear regression as Matrix-Vector Multiplication

Predictor (m,1) = Data matrix (m,n) * parameter vector (n,1)

Multiple competing hypothesis as Matrix Matrix Multiplication

Predictor (m,o) = Data matrix (m,n) * Parameter matrix (n,o)

parallelism
SIMD stands for "Single Instruction Multiple Data

Identity matrix, I, I nxn
Inverse matrix
Singular / degenerate matrix
Transpose

Multiple features

multivariate linear regression

hθ(x)=[θ0θ1...θn]⎡⎣⎢⎢⎢x0x1⋮xn⎤⎦⎥⎥⎥=θTx

Gradient Descent for Multiple Variables

}repeat until convergence:{θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))⋅x(i)jfor j := 0...n

Feature Scaling: , less skewed contour of the gradient descent

e.g. -1 to 1
mean normalization

Learning rate:

plot cost function J(theta) v.s. # of iterations
too small: slow convergence
too large: may not converge
try 3 times of alpha

Polynomial Regression

may simplify as multivariate linear
feature scaling to simplified multivariate linear

Normal Equation

feature scaling is not necessary
no need to choose learning rate, alpha
don't need to iterate
need to compute inverse matrix, pinv > close to n(3) cube
slow when # of features, n is large, e.g. n >= 10000
Noninvertibility

XtransposeX: if non-invertible (singular)
Octav: pinv (pseudo inverse), inv
do the right thing

redundant features (linearly dependent)
too many features (e.g. m<=n)

Prototyping language

Octave
MATLAB
Python
NumPy
R

A*B does a matrix multiply

A.*B does an element-wise multiplication

h(x) = theta' * x" vs. "h(x) = X * theta

Classification | Logistics regression

binary classification problem
Logistic regression model

Hypothesis: Sigmoid function / Logistic function, g(z)

hθ(x)=g(θTx)z=θTxg(z)=11+e−z

Decision boundary: property of hypothesis parameter, not training set

θTx≥0⇒y=1θTx<0⇒y=0

Cost function: convex

J(θ)=−m1∑i=1m[y(i) log(hθ(x(i)))+(1−y(i)) log(1−hθ(x(i)))]
h=g(Xθ)
J(θ)=1m⋅(−yTlog(h)−(1−y)Tlog(1−h))

Gradient Descent

θ:=θ−mαXT(g(Xθ)−y)

Advanced optimization: Conjugate gradient, BFGS, L-BFGS (fast & automatic alpha , but complex)
function (jVal, gradient) = costFunction(theta)
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
Multiclass Classification

Problem solving

under-fit: high biased
over-fit: high variance, failed to generate new example

Option1: Reduce numbers of feature

manually
model selection algorithm

Option2: Regularization

reduce the magnitude / value of features

Regularization

reduce over-fitting

cost function, regularizing all of parameters, theta
small value for parameter, theta: simpler hypothesis
lambda too large

Algorithm works fine. cannot hurt it
Algorithm failed to eliminate over-fitting.
Algorithm results in under-fitting. (failed to fit even the training example well)
Gradient decent will fail to converge

Adding a new feature to the model always results in equal or better performance NOT on the training set
Adding many new features to the model makes it more likely to overfit the training set.

Regularized Linear Regression

Gradient Descent
Normal Equation

Regularized Logistic Regression

Cost Function

Neutral network

non-linear hypothesis

sensor > cortex
data > algorithm

neuron in the brain > spikes, pulse of electricity

Input wire: Dendrite > X
Nucleus (cell body) > Activation function, Neural network (Input, Hidden, Output layer)

a(i) j: activation of unit i in layer j

Output wire: Axon > h(x) of parameters, weight

⎡⎣⎢⎢x0x1x2x3⎤⎦⎥⎥→⎡⎣⎢⎢⎢a(2)1a(2)2a(2)3⎤⎦⎥⎥⎥→hθ(x)

Forward propagation

vectorization
the network architecture, learn from its own features

z(j)=Θ(j−1)a(j−1)

Θ(j−1) with dimensions

s_j\times (n+1)

vector

a^{(j-1)}

with height (n+1)

a(j)=g(z(j))

hΘ(x)=a(j+1)=g(z(j+1))

Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in logistic regression

Examples

AND, OR, NOT, NOR: logistics regression like
Multiclass classification: multiple output units, one-vs-all

Neutral network (Classification)

L: # of layers
S(L): # of units (excluding bias unit)
Binary classification: S(L) = 1
Multi-class classification: S(L) = K (K >=3)
Cost function (Logistics regression)

K = k th units
i = m th output
Theta (0) is NOT summed, i.e. i=1:S(l)

J(Θ)=−1m∑i=1m∑k=1K[y(i)klog((hΘ(x(i)))k)+(1−y(i)k)log(1−(hΘ(x(i)))k)]+λ2m∑l=1L−1∑i=1sl∑j=1sl+1(Θ(l)j,i)2

Backpropagation

Gradient: compute cost, partial derivative cost

delta(j,l): ERROR of unit/node j of layer l

For training example t =1 to m:

Set $a^{(1)} := x^{(t)}$
Perform forward propagation to compute $a^{(l)}$ for l=2,3,…,L

3. Using

y^{(t)}

, compute

\delta^{(L)} = a^{(L)} - y^{(t)}

4. Compute

\delta^{(L-1)}, \delta^{(L-2)},\dots,\delta^{(2)}

using

δ(l)=((Θ(l))Tδ(l+1)) .∗ a(l) .∗ (1−a(l))

The g-prime derivative terms can also be written out as:

g'(z^{(l)}) = a^{(l)}\ .*\ (1 - a^{(l)})

\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}

or with vectorization,

\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T

Hence we update our new

\Delta

matrix.

D(l)i,j:=1m(Δ(l)i,j+λΘ(l)i,j), if j≠0.
$D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j}$ If j=0

Gradient Checking

gradApprox (numerical) ~~ DVec (back propagation)

Random initialization

zero initialization > identical hidden units (it doesn't work)
symmetry breaking

Practical neural network algorithm

pick a network architecture

# of output units = number of classes
# of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.
# of input units = dimension of features x^{(i)}x (i)

Randomly initialize the weights
Implement forward propagation to get hΘ(x(i)) for any x^{(i)}x (i)
Implement the cost function
Implement backpropagation to compute partial derivatives
Use gradient checking to confirm that your backpropagation works. Then disable gradient checking
Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta. (keep in mind that J(Θ) is NOT convex and thus we can end up in a LOCAL minimum instead.)

Autonomous drive

Advice & Guideline

debug a learning algorithm > Machine learning diagnostic

overfitting
underfitting

Model selection (degree of polynomial), d

Train: 60%
Cross Validation: 20%
Test: 20%

Bias & Variance, J

High bias (underfitting): high J(train); high J(cv) ~ J(train)
High variance (overfitting): low J(train); J(cv) >> J(train)

Regularization, lambda

Learning curve, m

Fruitful Fixes

Getting more training examples: high variance
Trying smaller sets of features: high variance
Adding features: high bias
Adding polynomial features: high bias
Decreasing λ: high bias
Increasing λ: high variance

Design

Building spam classifier

Collect lots of data, high frequent occurring words (1000-5000)
Develop sophisticated features (using email header data in spam emails)
Develop algorithms to process your input in different ways (recognizing misspellings in spam).
Error analysis

quick & dirty first (simple algorithm, one day)
learning curve
manually examine the example of error made in the validation

Porter Stemming

Handling skewed data (rare class)

Precision: true positive / (true positive + false positive)
Recall: true positive / (true positive + true negative)
Trading off precision and recall: F score = 2 PR / (P+R)

Using large dataset

Support Vector Machine

SVM hypothesis

cost1(z): if y=1, z >= 1
cost0(z): if y=0, z <= -1
Optimize objective of logistic regression: A+lambda * B > C*A+B

Large margin classification

SVM decision boundary > margin
mathematics behind: ||u||, length of the vector, u
linear decision boundary
non-linear decision boundary

Kernel

compute similarity(Gaussian kernel): features based on land marks, l1, l2, l3
x ~ landmark 1, feature 1 = 1
choosing landmarks: X > f = similarity (x,l)
SVM with kernels: off-the -shelf software
C = (1/lambda)

large C > small lambda > high bias (underfitting)
small C > large lambda > high variance (overfitting)

Sigma square: threshold

SVM in practice

parameter theta: SVM package ( libSVM, liblinear, ...)
Need to specify

parameter C
kernel, similarity: Gaussian, Polynomial, String, Chi-square, Histogram intersection
do perform feature scaling first
satisfy "Mercer's Thereom", converge

multi-class classification
logistics regression vs. SVM

# features, n >> # example, m : logistics regression or SVM w/o kernel
small n, intermediate m : Gaussian kernel
small n, large m : logistics regression or SVM w/o kernel

NN: work well, but slow to train

WK8: Unsupervised Learning

Applications of Clustering

Market segment
social network analysis
organize datacenter cluster
astronomical data analysis

Principle Component Analysis (PCA)

find a direction/vector to minimize the projection error
reduce from n-dimension to k-dimension
NOT a linear regression

Reduce from n-dimension to k-dimension

Data preproccessing

mean normalization
scale features by n-2 or sigma

Covariance matrix, capital sigma

Sigma = (1/m)*X'*X
[u, s, v] = svd(Sigma): single value decomposition, n x n matrix
eig(sigma)

from [u, s, v] = svd(Sigma)

Ureduce = u[:,1:k]
z = Ureduce'*x
choosing k (# of principle components): s(1:k) / s(1:n) > 99%

Advice: apply PCA only to the training dataset (mapping Xtraining to z)

Application

Compression: reduce memory/disk to store data
Compression: speed up the algorithm
Visualization: K=2 or 3

Bad use o PCA

to prevent overfitting (reduce features): should use regularization instead
apply PCA, without trying original/raw data: check if storage shortage/slow performance issue

Anomaly Detection

Gaussian (Normal) distribution

mu, sigma

Algorithm evaluation (skewed class)

true positive, false positive, true negative, false negative
precision / recall
F- score

vs. supervised training

small number (0-20) of positive examples
different types of (future) anomalies

Choose Gaussian features

hist(x,bin)
log transform
exponent
Error analysis

Multivariate Gaussian Distribution

Large scale

Online learning

Stochastic gradient descent

randomly shuffle datasets
during learning, check cost(theta,x,y), before update theta using x, y
plot average cost over 1000(say) iterations
check not-so-noisy converge
use smaller alpha (typically constant) , if cost increasing

mini-batch gradient descent

batch, instead one example per iteration

Map-reduce & Data parallelism

split training set
parallel machines / multi-core machines
sums of the function over training set

Summary

Supervised Learning

Linear regression | Logistics regression | Neurel network | SVMs

Unsupervised Learning

K-mean | PCA | Anomaly detection

Special application

Recommender system | large scale ML

Advice on ML buildup

Machine Learning by Andrew Ng

Have a machine learn to do it by itself

Supervised Learning

Unsupervised Learning

Linear regression with one variable

Model and Cost function

Gradient Descent algorithm

Gradient Descent algorithm for linear regression

Linear regression as Matrix-Vector Multiplication

Multiple competing hypothesis as Matrix Matrix Multiplication

Multiple features

Normal Equation

Classification | Logistics regression

Problem solving

Regularization

Regularized Linear Regression

Regularized Logistic Regression

Forward propagation

Backpropagation

Gradient Checking

Random initialization

Practical neural network algorithm

Building spam classifier

Handling skewed data (rare class)

Support Vector Machine

SVM hypothesis

Large margin classification

Kernel

SVM in practice

Principle Component Analysis (PCA)

Anomaly Detection

Large scale

Online learning

Map-reduce & Data parallelism

Summary

Supervised Learning

Unsupervised Learning

Special application

Advice on ML buildup

Comments

Post a Comment

Popular Posts