Review of Numerical Methods

Basics:

  •  Vectors:
    • plus, minus, scalar multiplication
    • magnitude and direction
    • inner product and angle
      • Cauchy-Schwarz inequality
    • parallel and orthogonal
    • projection
    • cross product
      • orthogonal to both the two input vectors
      • right-hand rule
      • magnitude is the area size of the parallelogram, or twice of the area size of the triangle
  • Linear equations
    • Lines in two-dimensions
      • Parameterization: define a line: basepoint, direction vector
      • Direction vector for Ax+By=k is [B, -A]. A normal vector is [A, B].
      • Two lines are parallel if their normal vectors are parallel vectors.
      • If two lines are not parallel, then they have a unique intersection.
      • If two lines are parallel, they may not have intersection at all or the same line with infinitely many intersections.
      • Two parallel lines are equal, <=> the vector connecting one point on each line is orthogonal to the lines’ normal vectors.
      • If two non-parallel lines Ax+By=k1, Cx+Dy=k2; then A and C have one zero at most, AD-BC!=0.
        • The intersection is x=(Dk1-Bk2)/(AD-BC); y=(-Ck1+Ak2)/(AD-BC).
      • Use normal vectors is better for high dimensions.
    • Planes in three dimensions
      • Ax+By+Cz=k,
      • Normal vector: [A,B,C]
      • If two planes are equal <=> the vector connecting one point on each plane is orthogonal to the planes’ normal vectors.
      • Given Ax+By+Cz=k1, Dx+Ey+Fz=k2, possible solutions sets are:
        • a line with direction vector [A,B,C] x [D,E,F], if planes are not parallel;
        • no solutions, if planes are parallel but not equal;
        • a plane, if the planes are the same.
      • More planes could intersect in a single point.
      • We need at least two lines in two variables to obtain a unique intersection; We need at least three planes in three variables to obtain a unique intersection.
      • Rules for manipulating equations
        • Should preserve the solution, should be reversible
        • swap order of equations
        • multiply an equation by a nonzero number
        • Add a multiple of an equation to another
      • A system is inconsistent <=> we find 0=k for k nonzero during Gaussian elimination.
      • It’s not enough to count # of equations (usually) or look for 0=0 to determine if infinitely many solutions.
      • A consistent system has a unique solution <=> each variable is a pivot variable.
      • #Free variables = dimension of solution set

Study of Deep Learning

Materials:

  • Video:

Knowledge

  • Softmax layer
    • the output layer
    • output a probability for each class
  • forward evaluation
  • backward propagation
    • Update weights
    • E.g. Gradient descent
  • ground truth
  • FFN: Feed Forward Neural Net
  • Set initial weights
    • Auto encoder
  • Data representation
    •  features
      • categorical features
        • no intrinsic ordering
        • require additional encoding, usually one-hot encoding (illustration)
      • ordinal features
    • Pre-process dataset:
      • min-max normalization
        • when know min and max
        • Learn faster
        • prevent numerical error
      • Standardization
        • when don’t know min and max
    • Overfitting
      • Avoid:
        • Dropout: only for deep learning
        • regularization
  • Network structure
    • Depth: #Hidden layers
      • width and #parameters determine the depth
    • Width: the dimension of each layer
      • < 1000 usually, max few hundreds neurons per hidden layers
    • Connectivity: how neurons are connected among each other
    • #Parameters: determined by the above three factors.
      • Too many will overfit.
      • “Sample/parameter” ratio: usually between 5 to 30.
    • Shape: “tower” vs “pyramid” shape
      • Usually “pyramid” shape
      • Deeper is better.
      • Thin-tall is better than fat-short.
  • Activation function
    • Like a switch
    • Usually non-linear functions
    • E.g.
      • sigmoid, ranging from 0 to 1
        • Deep network: vanishing gradient
        • Used in Recurrent NN (RNN), RSTM, not in feed forward
      • ReLU, ranging from 0 to x
        • Avoid vanishing gradient
        • Mostly commonly-used
        • Used in feed-forward NN
      • tanh, ranging from -1 to 1
        • Commonly-used when the features range from negatives
        • In NLP
  • Loss (or cost) function
    • cross-entropy
      • More suitable for predicting categorical labels
    • squared error
      • More suitable for predicting continuous values
    • Why
      • Compare the surface of different loss functions
      • The difference of “steepness”

“Sparse Tensor Factorization on Many-Core Processors with High-Bandwidth Memory” — Shaden Smith et al. 2017

Paper: TODO

Code:[Link]

Features:

  • maintain load balance and low synchronization
  • explore of architectural features, e.g. vectorization, synchronization (mutexes, compare-and-swap, transactional memory, privatization), managing high-bandwidth memory (MCDRAM).
  • Platform: One KNL processor
  • Speedup: 1.8x speedup over a dual socket Intel Xeon 44-core system.

Findings:

Other Knowledge:

  • HPC systems are increasingly used for data intensive computations which exhibit irregular memory accesses, non-uniform work distributions, large memory footprints, and high memory bandwidth demands.
  • sparse, unstructured tensors
  • Challenges of optimization algorithms on many-core processors:
    a high degree of parallelism, load balance tens to hundreds of parallel threads, and effectively utilize the high-bandwidth memory.

Useful reference:

Dataset: