Materials:

Knowledge

- Softmax layer
- the output layer
- output a probability for each class

- forward evaluation
- backward propagation
- Update weights
- E.g. Gradient descent

- ground truth
- FFN: Feed Forward Neural Net
- Set initial weights
- Auto encoder

- Data representation
- features
- categorical features
- no intrinsic ordering
- require additional encoding, usually one-hot encoding (illustration)

- ordinal features

- categorical features
- Pre-process dataset:
- min-max normalization
- when know min and max
- Learn faster
- prevent numerical error

- Standardization
- when don’t know min and max

- min-max normalization
- Overfitting
- Avoid:
- Dropout: only for deep learning
- regularization

- Avoid:

- features
- Network structure
- Depth: #Hidden layers
- width and #parameters determine the depth

- Width: the dimension of each layer
- < 1000 usually, max few hundreds neurons per hidden layers

- Connectivity: how neurons are connected among each other
- #Parameters: determined by the above three factors.
- Too many will overfit.
- “Sample/parameter” ratio: usually between 5 to 30.

- Shape: “tower” vs “pyramid” shape
- Usually “pyramid” shape
- Deeper is better.
- Thin-tall is better than fat-short.

- Depth: #Hidden layers
- Activation function
- Like a switch
- Usually non-linear functions
- E.g.
- sigmoid, ranging from 0 to 1
- Deep network: vanishing gradient
- Used in Recurrent NN (RNN), RSTM, not in feed forward

- ReLU, ranging from 0 to x
- Avoid vanishing gradient
- Mostly commonly-used
- Used in feed-forward NN

- tanh, ranging from -1 to 1
- Commonly-used when the features range from negatives
- In NLP

- sigmoid, ranging from 0 to 1

- Loss (or cost) function
- cross-entropy
- More suitable for predicting categorical labels

- squared error
- More suitable for predicting continuous values

- Why
- Compare the surface of different loss functions
- The difference of “steepness”

- cross-entropy

Advertisements