Study of Deep Learning

Materials:

  • Video:

Knowledge

  • Softmax layer
    • the output layer
    • output a probability for each class
  • forward evaluation
  • backward propagation
    • Update weights
    • E.g. Gradient descent
  • ground truth
  • FFN: Feed Forward Neural Net
  • Set initial weights
    • Auto encoder
  • Data representation
    •  features
      • categorical features
        • no intrinsic ordering
        • require additional encoding, usually one-hot encoding (illustration)
      • ordinal features
    • Pre-process dataset:
      • min-max normalization
        • when know min and max
        • Learn faster
        • prevent numerical error
      • Standardization
        • when don’t know min and max
    • Overfitting
      • Avoid:
        • Dropout: only for deep learning
        • regularization
  • Network structure
    • Depth: #Hidden layers
      • width and #parameters determine the depth
    • Width: the dimension of each layer
      • < 1000 usually, max few hundreds neurons per hidden layers
    • Connectivity: how neurons are connected among each other
    • #Parameters: determined by the above three factors.
      • Too many will overfit.
      • “Sample/parameter” ratio: usually between 5 to 30.
    • Shape: “tower” vs “pyramid” shape
      • Usually “pyramid” shape
      • Deeper is better.
      • Thin-tall is better than fat-short.
  • Activation function
    • Like a switch
    • Usually non-linear functions
    • E.g.
      • sigmoid, ranging from 0 to 1
        • Deep network: vanishing gradient
        • Used in Recurrent NN (RNN), RSTM, not in feed forward
      • ReLU, ranging from 0 to x
        • Avoid vanishing gradient
        • Mostly commonly-used
        • Used in feed-forward NN
      • tanh, ranging from -1 to 1
        • Commonly-used when the features range from negatives
        • In NLP
  • Loss (or cost) function
    • cross-entropy
      • More suitable for predicting categorical labels
    • squared error
      • More suitable for predicting continuous values
    • Why
      • Compare the surface of different loss functions
      • The difference of “steepness”
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s