Materials:
Knowledge
- Softmax layer
- the output layer
- output a probability for each class
- forward evaluation
- backward propagation
- Update weights
- E.g. Gradient descent
- ground truth
- FFN: Feed Forward Neural Net
- Set initial weights
- Auto encoder
- Data representation
- features
- categorical features
- no intrinsic ordering
- require additional encoding, usually one-hot encoding (illustration)
- ordinal features
- categorical features
- Pre-process dataset:
- min-max normalization
- when know min and max
- Learn faster
- prevent numerical error
- Standardization
- when don’t know min and max
- min-max normalization
- Overfitting
- Avoid:
- Dropout: only for deep learning
- regularization
- Avoid:
- features
- Network structure
- Depth: #Hidden layers
- width and #parameters determine the depth
- Width: the dimension of each layer
- < 1000 usually, max few hundreds neurons per hidden layers
- Connectivity: how neurons are connected among each other
- #Parameters: determined by the above three factors.
- Too many will overfit.
- “Sample/parameter” ratio: usually between 5 to 30.
- Shape: “tower” vs “pyramid” shape
- Usually “pyramid” shape
- Deeper is better.
- Thin-tall is better than fat-short.
- Depth: #Hidden layers
- Activation function
- Like a switch
- Usually non-linear functions
- E.g.
- sigmoid, ranging from 0 to 1
- Deep network: vanishing gradient
- Used in Recurrent NN (RNN), RSTM, not in feed forward
- ReLU, ranging from 0 to x
- Avoid vanishing gradient
- Mostly commonly-used
- Used in feed-forward NN
- tanh, ranging from -1 to 1
- Commonly-used when the features range from negatives
- In NLP
- sigmoid, ranging from 0 to 1
- Loss (or cost) function
- cross-entropy
- More suitable for predicting categorical labels
- squared error
- More suitable for predicting continuous values
- Why
- Compare the surface of different loss functions
- The difference of “steepness”
- cross-entropy
Advertisements