Machine Learning by Andrew Ng
Source: Coursera
Introduction | Linear regression with one variable | Linear Algebra
Linear regression with multiple variables | Octave/Matlab
Logistics regression | Regularization
Neutral networks | Model | Forward propagation
Neutral networks | Cost function | Back propagation
ML | Advice | Fix | Design
Unsupervised | K-Means | PCA
Anomaly Detection | Gussian dist.
Learning with Large Datasets
Introduction | Linear regression with one variable | Linear Algebra
Linear regression with multiple variables | Octave/Matlab
Logistics regression | Regularization
Neutral networks | Model | Forward propagation
Neutral networks | Cost function | Back propagation
ML | Advice | Fix | Design
Unsupervised | K-Means | PCA
Anomaly Detection | Gussian dist.
Learning with Large Datasets
Have a machine learn to do it by itself
(without explicitly programmed)- big data mining: webclick, biology
- applications cannot program by hand: auto-pilot, Natural Language Processing, Computer Vision
- self-customization program: recommendation
- understand human learning
ML definition
- Task, T
- Experience, E
- Performance, P
perform on T > measure P > improve E
practical way for applying learning algorithm
e.g. google news, genes, computing clusters, social network analysis, market segment,
θj:=θj−α∂∂θjJ(θ0,θ1)
repeat until convergence: {θ0:=θ1:=}θ0−α1m∑i=1m(hθ(xi)−yi)θ1−α1m∑i=1m((hθ(xi)−yi)xi)
Identity matrix, I, I nxn
Inverse matrix
Singular / degenerate matrix
Transpose
Supervised Learning
RIGHT answer given (i.e. training set)- regression: continuous
- classification: discrete
Unsupervised Learning
no labels > structure of data / patterne.g. google news, genes, computing clusters, social network analysis, market segment,
- Clustering
- Non-clustering
Linear regression with one variable
Model and Cost function
- Model: Training set > Learning algorithm > Hypothesis, h, predictor for the corresponding value of y given a x
- Cost function (Square error function) for regression: minimize the error/difference between h(x) with input x's & actual output, y's
- Goal: minimize cost function J(theta) of parameter theta
- Contour plot/figure > Gradient Descent: linear regression with one variable
y^=hθ(x)=θ0+θ1x
J(θ0,θ1)=12m∑i=1m(y^i−yi)2=12m∑i=1m(hθ(xi)−yi)2
Gradient Descent algorithm
(repeat until convergence with simultaneous update on theta j)θj:=θj−α∂∂θjJ(θ0,θ1)
- a := b assignment
- a = b true assertion
- alpha: learning rate
- small, slow to descending
- large to overshoot the minimum
- derivative term (e.g. slop with one theta, slop = 0, reach at a local optima)
Gradient Descent algorithm for linear regression
(simultaneous update on theta 0 and theta 1)repeat until convergence: {θ0:=θ1:=}θ0−α1m∑i=1m(hθ(xi)−yi)θ1−α1m∑i=1m((hθ(xi)−yi)xi)
- convex function: bow shape, only one global optima
- batch: use all training set/example
Linear regression as Matrix-Vector Multiplication
Predictor (m,1) = Data matrix (m,n) * parameter vector (n,1)
Multiple competing hypothesis as Matrix Matrix Multiplication
Predictor (m,o) = Data matrix (m,n) * Parameter matrix (n,o)
- parallelism
- SIMD stands for "Single Instruction Multiple Data
Identity matrix, I, I nxn
Inverse matrix
Singular / degenerate matrix
Transpose
Multiple features
- multivariate linear regression
hθ(x)=[θ0θ1...θn]⎡⎣⎢⎢⎢x0x1⋮xn⎤⎦⎥⎥⎥=θTx
- Gradient Descent for Multiple Variables
}repeat until convergence:{θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))⋅x(i)jfor j := 0...n
- Feature Scaling: , less skewed contour of the gradient descent
- e.g. -1 to 1
- mean normalization
- Learning rate:
- plot cost function J(theta) v.s. # of iterations
- too small: slow convergence
- too large: may not converge
- try 3 times of alpha
- Polynomial Regression
- may simplify as multivariate linear
- feature scaling to simplified multivariate linear
Normal Equation
- feature scaling is not necessary
- no need to choose learning rate, alpha
- don't need to iterate
- need to compute inverse matrix, pinv > close to n(3) cube
- slow when # of features, n is large, e.g. n >= 10000
- Noninvertibility
- XtransposeX: if non-invertible (singular)
- Octav: pinv (pseudo inverse), inv
- do the right thing
- redundant features (linearly dependent)
- too many features (e.g. m<=n)
Prototyping language
- Octave
- MATLAB
- Python
- NumPy
- R
A*B does a matrix multiply
A.*B does an element-wise multiplication
h(x) = theta' * x" vs. "h(x) = X * theta
Classification | Logistics regression
binary classification problemLogistic regression model
- Hypothesis: Sigmoid function / Logistic function, g(z)
- Decision boundary: property of hypothesis parameter, not training set
- Cost function: convex
h=g(Xθ)
J(θ)=1m⋅(−yTlog(h)−(1−y)Tlog(1−h))
- Gradient Descent
θ:=θ−mαXT(g(Xθ)−y)
- Advanced optimization: Conjugate gradient, BFGS, L-BFGS (fast & automatic alpha , but complex)
- function (jVal, gradient) = costFunction(theta)
- [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
- Multiclass Classification
Problem solving
- under-fit: high biased
- over-fit: high variance, failed to generate new example
- Option1: Reduce numbers of feature
- manually
- model selection algorithm
- Option2: Regularization
- reduce the magnitude / value of features
Regularization
reduce over-fitting
- cost function, regularizing all of parameters, theta
- small value for parameter, theta: simpler hypothesis
- lambda too large
- Algorithm works fine. cannot hurt it
- Algorithm failed to eliminate over-fitting.
- Algorithm results in under-fitting. (failed to fit even the training example well)
- Gradient decent will fail to converge
- Adding a new feature to the model always results in equal or better performance NOT on the training set
- Adding many new features to the model makes it more likely to overfit the training set.
Regularized Linear Regression
- Gradient Descent
- Normal Equation
Regularized Logistic Regression
- Cost Function
Neutral network
non-linear hypothesis
sensor > cortex
data > algorithm
data > algorithm
neuron in the brain > spikes, pulse of electricity
- Input wire: Dendrite > X
- Nucleus (cell body) > Activation function, Neural network (Input, Hidden, Output layer)
- a(i) j: activation of unit i in layer j
- Output wire: Axon > h(x) of parameters, weight
⎡⎣⎢⎢x0x1x2x3⎤⎦⎥⎥→⎡⎣⎢⎢⎢a(2)1a(2)2a(2)3⎤⎦⎥⎥⎥→hθ(x)
z(j)=Θ(j−1)a(j−1)
Θ(j−1) with dimensions
vector with height (n+1)
a(j)=g(z(j))
hΘ(x)=a(j+1)=g(z(j+1))
Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in logistic regression
Examples
Forward propagation
- vectorization
- the network architecture, learn from its own features
z(j)=Θ(j−1)a(j−1)
vector with height (n+1)
a(j)=g(z(j))
hΘ(x)=a(j+1)=g(z(j+1))
Notice that in this last step, between layer j and layer j+1, we are doing exactly the same thing as we did in logistic regression
Examples
- AND, OR, NOT, NOR: logistics regression like
- Multiclass classification: multiple output units, one-vs-all
Neutral network (Classification)
- L: # of layers
- S(L): # of units (excluding bias unit)
- Binary classification: S(L) = 1
- Multi-class classification: S(L) = K (K >=3)
- Cost function (Logistics regression)
- K = k th units
- i = m th output
- Theta (0) is NOT summed, i.e. i=1:S(l)
Backpropagation
Gradient: compute cost, partial derivative cost
delta(j,l): ERROR of unit/node j of layer l
For training example t =1 to m:
- Set
- Perform forward propagation to compute for l=2,3,…,L
3. Using , compute
4. Compute using δ(l)=((Θ(l))Tδ(l+1)) .∗ a(l) .∗ (1−a(l))
The g-prime derivative terms can also be written out as:
5. or with vectorization,
Hence we update our new matrix.
D(l)i,j:=1m(Δ(l)i,j+λΘ(l)i,j) , if j≠0.- If j=0
Gradient Checking
- gradApprox (numerical) ~~ DVec (back propagation)
Random initialization
- zero initialization > identical hidden units (it doesn't work)
- symmetry breaking
Practical neural network algorithm
- pick a network architecture
- # of output units = number of classes
- # of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
- Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.
- # of input units = dimension of features x^{(i)}x (i)
- Randomly initialize the weights
- Implement forward propagation to get hΘ(x(i)) for any x^{(i)}x (i)
- Implement the cost function
- Implement backpropagation to compute partial derivatives
- Use gradient checking to confirm that your backpropagation works. Then disable gradient checking
- Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta. (keep in mind that J(Θ) is NOT convex and thus we can end up in a LOCAL minimum instead.)
Autonomous drive
Advice & Guideline
debug a learning algorithm > Machine learning diagnostic- overfitting
- underfitting
- Train: 60%
- Cross Validation: 20%
- Test: 20%
- High bias (underfitting): high J(train); high J(cv) ~ J(train)
- High variance (overfitting): low J(train); J(cv) >> J(train)
Learning curve, m
Fruitful Fixes
- Getting more training examples: high variance
- Trying smaller sets of features: high variance
- Adding features: high bias
- Adding polynomial features: high bias
- Decreasing λ: high bias
- Increasing λ: high variance
Design
Building spam classifier
- Collect lots of data, high frequent occurring words (1000-5000)
- Develop sophisticated features (using email header data in spam emails)
- Develop algorithms to process your input in different ways (recognizing misspellings in spam).
- Error analysis
- quick & dirty first (simple algorithm, one day)
- learning curve
- manually examine the example of error made in the validation
- Porter Stemming
Handling skewed data (rare class)
- Precision: true positive / (true positive + false positive)
- Recall: true positive / (true positive + true negative)
- Trading off precision and recall: F score = 2 PR / (P+R)
Support Vector Machine
SVM hypothesis
- cost1(z): if y=1, z >= 1
- cost0(z): if y=0, z <= -1
- Optimize objective of logistic regression: A+lambda * B > C*A+B
Large margin classification
- SVM decision boundary > margin
- mathematics behind: ||u||, length of the vector, u
- linear decision boundary
- non-linear decision boundary
Kernel
- compute similarity(Gaussian kernel): features based on land marks, l1, l2, l3
- x ~ landmark 1, feature 1 = 1
- choosing landmarks: X > f = similarity (x,l)
- SVM with kernels: off-the -shelf software
- C = (1/lambda)
- large C > small lambda > high bias (underfitting)
- small C > large lambda > high variance (overfitting)
- Sigma square: threshold
SVM in practice
- parameter theta: SVM package ( libSVM, liblinear, ...)
- Need to specify
- parameter C
- kernel, similarity: Gaussian, Polynomial, String, Chi-square, Histogram intersection
- do perform feature scaling first
- satisfy "Mercer's Thereom", converge
- multi-class classification
- logistics regression vs. SVM
- # features, n >> # example, m : logistics regression or SVM w/o kernel
- small n, intermediate m : Gaussian kernel
- small n, large m : logistics regression or SVM w/o kernel
- NN: work well, but slow to train
WK8: Unsupervised Learning
Applications of Clustering- Market segment
- social network analysis
- organize datacenter cluster
- astronomical data analysis
Principle Component Analysis (PCA)
- find a direction/vector to minimize the projection error
- reduce from n-dimension to k-dimension
- NOT a linear regression
- Data preproccessing
- mean normalization
- scale features by n-2 or sigma
- Covariance matrix, capital sigma
- Sigma = (1/m)*X'*X
- [u, s, v] = svd(Sigma): single value decomposition, n x n matrix
- eig(sigma)
- from [u, s, v] = svd(Sigma)
- Ureduce = u[:,1:k]
- z = Ureduce'*x
- choosing k (# of principle components): s(1:k) / s(1:n) > 99%
Advice: apply PCA only to the training dataset (mapping Xtraining to z)
Application
- Compression: reduce memory/disk to store data
- Compression: speed up the algorithm
- Visualization: K=2 or 3
Bad use o PCA
- to prevent overfitting (reduce features): should use regularization instead
- apply PCA, without trying original/raw data: check if storage shortage/slow performance issue
Anomaly Detection
Gaussian (Normal) distribution- mu, sigma
- true positive, false positive, true negative, false negative
- precision / recall
- F- score
vs. supervised training
Multivariate Gaussian Distribution
mini-batch gradient descent
- small number (0-20) of positive examples
- different types of (future) anomalies
Choose Gaussian features
- hist(x,bin)
- log transform
- exponent
- Error analysis
Large scale
Online learning
Stochastic gradient descent- randomly shuffle datasets
- during learning, check cost(theta,x,y), before update theta using x, y
- plot average cost over 1000(say) iterations
- check not-so-noisy converge
- use smaller alpha (typically constant) , if cost increasing
mini-batch gradient descent
- batch, instead one example per iteration
Map-reduce & Data parallelism
- split training set
- parallel machines / multi-core machines
- sums of the function over training set
Comments
Post a Comment