# Topics in Modern Machine Learning

### Theme of this iteration

**theoretical foundations of deep learning**… examines deep learning through the lens of classic and modern approximation, optimization, and learning theory

### Course info

**Syllabus**: PDF

**Instructor**: Professor Ju Sun Email: jusun AT umn.edu

**When/Where**: Mon/Wed 2:30 – 3:45pm @ Amundson Hall 156

**Office Hours**: Mon/Wed 4 – 5pm @ Zoom or my office

### Schedule

General readings: PNAS Colloquium on The Science of Deep Learning, Dec 2020. The course is organized around six thrusts.

#### Approximation

- Surveys
- Papers
- Visual proof of universality for shallow networks
- Approximation theory of the MLP model in neural networks (Survey of UAT focused on shallow networks)
- Multilayer feedforward networks are universal approximators (UAT of both deep and shallow networks)
- Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations (ReLU networks with bounded width and unbounded depth)
- Approximating continuous functions by ReLU nets of minimal width (ReLU networks with bounded width and unbounded depth)
- Error bounds for approximations with deep ReLU networks (ReLU network approximation of smooth functions)
- Error bounds for approximations with deep relu neural networks in $W^{s, p}$ norms (ReLU network approximation of smooth functions)
- Provable approximation properties for deep neural networks (Approximation of functions on manifolds)
- The power of deeper networks for expressing natural functions (Exponential separation of deep and shallow networks in approximating polynomials)
- Benefits of depth in neural networks (Exponential separation of deep and shallow networks due to composition of ReLU activated layers)
- Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review (separation of deep and shallow networks on compositional functions)
- High-dimensional data analysis: The blessings and curses of dimensionality (Donoho’s 2000 talk on curse/blessing of dimensionality in modern data analysis)
- Universal approximations of invariant maps by neural networks (Approximation of invariant and equivariant functions wrt finite groups)
- Equivalence of approximation by convolutional neural networks and fully-connected networks (Approximation of equivariant functions using CNNs)

- Further readings
- The gap between theory and practice in function approximation with deep neural networks
- Proof of the Theory-to-Practice Gap in Deep Learning via Sampling Complexity bounds for Neural Network Approximation Spaces
- A deep network construction that adapts to intrinsic dimensionality beyond the domain (Approximation of functions on manifolds with refined rates)

#### Optimization \& Generalization

- Surveys
- Chaps 1–2, 5, 6 of The Modern Mathematics of Deep Learning
- Deep learning: a statistical viewpoint
- Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation
- A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning

- Papers
- Recent Theoretical Advances in Non-Convex Optimization (computational complexity of algorithms for nonconvex problems)
- Optimization for deep learning: theory and algorithms (deep learning algorithms, tricks, and theories thereof)
- From Symmetry to Geometry: Tractable Nonconvex Problems (problems that are “essentially” convex up to symmetries)
- Accelerated methods for $\alpha$-weakly-quasi-convex problems (early results on optimizing star-convex functions)
- Primal-dual accelerated gradient methods with small-dimensional relaxation oracle (early results on optimizing star-convex functions)
- Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond (state-of-the-art results on optimizing star-convex functions)
- An Alternative View: When Does SGD Escape Local Minima? (optimizing of neural networks via star convexity)
- SGD Converges to Global Minimum in Deep Learning via Star-convex Path (optimizing of neural networks via star convexity)
- Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition (state-of-the-art convergence results based on PL conditions)
- On the linearity of large non-linear models: when and why the tangent kernel is constant (explains when and why model linearization makes sense for nonlinear models)
- Loss landscapes and optimization in over-parameterized non-linear systems and neural networks (non-linear systems and neural networks satisfy the PL conditions under certain conditions)
- On Lazy Training in Differentiable Programming (when and why linearization makes sense, and when not—esp. in practice)
- Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? (another proof of PL conditions imply global convergence and short optimization paths for undertermined nonlinear models)
- Towards moderate overparameterization: global convergence guarantees for training shallow neural networks (specialization and improvement of the above result for shallow neural networks)
- {Euclidean, metric, and Wasserstein} gradient flows: an overview (a nice tutorial on gradient flows)
- Computational Optimal Transport (optimal transport problems, Wasserstein distances, and (Chap 9) Wasserstein gradient flow)
- Gradient Descent on Infinitely Wide Neural Networks: Global Convergence and Generalization (overview of global convergence and generalization of 2-layer NNs using mean-field analysis)
- On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport (the 1st technical paper behind the above review paper)
- Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss (the 2nd technical paper behind the above review paper)
- A Mean Field View of the Landscape of Two-Layers Neural Networks (complementary mean-field analysis)
- Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit (complementary mean-field analysis)
- A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks (mean-field analysis of deep networks)
- Introduction to Statistical Learning Theory (quick review of classic statistical learning theory)
- Theory of Classification: a Survey of Some Recent Advances (quick review of more recent developments of statistical learning theory)
- Learning without Concentration (dealing with heavy-tailed data and functions in learning theory using one-side concentration)
- Extending the scope of the small-ball method (dealing with heavy-tailed data and functions in learning theory using one-side concentration)
- Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks (state-of-the-art VC dimension results for DNNs)
- Rademacher and Gaussian Complexities: Risk Bounds and Structural Results (introduction of Rademacher and Gaussian complexities; Rad complexity of 2-layer NN)
- Spectrally-normalized margin bounds for neural networks (state-of-the-art Rad complexity estimation for DNNs)
- Size-Independent Sample Complexity of Neural Networks (another state-of-the-art Rad complexity estimation for DNNs)
- Understanding deep learning (still) requires rethinking generalization (distribution-independent bounds unlikely to explain the generalization of DL)
- To understand deep learning we need to understand kernel learning (distribution-dependent bounds may also struggle to explain the generalization of DL)
- Uniform convergence may be unable to explain generalization in deep learning (potential failure of two-sided uniform convergence results for DL)
- Failures of model-dependent generalization bounds for least-norm interpolation (a complementary (stronger) version of the above)
- PAC-Bayesian Learning of Linear Classifiers (PAC-Bayesian derandomization for linear predictors)
- A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks (PAC-Bayesian results for DNNs)
- Studying generalization in deep learning via PAC-Bayes (tutorial talk by Gintare Karolina Dziugaite)
- PAC-Bayesian Transportation Bound (Abstract view of PAC Bayesian, and its connection to transportation bounds)
- Stability and Generalization (classic results connecting stability and generalization)
- Learnability, Stability and Uniform Convergence (classic results connecting stability and generalization)
- Sharper bounds for uniformly stable algorithms (state-of-the-art generalization bounds under uniform stability)
- Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses (algorithm stability of SGD on convex problems)
- Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent (refined notion of stability for SGD on nonconvex problems)
- High probability generalization bounds for uniformly stable algorithms with nearly optimal rate (high probability generalization bounds based on stability for convex problems)
- High-probability Convergence Bounds for Non-convex Stochastic Gradient Descent (high probability generalization bounds based on stability for nonconvex problems under certain conditions)
- The Implicit Bias of Gradient Descent on Separable Data (Algorithm bias of GD on several classical learning models)
- Boosting with early stopping: Convergence and consistency (Algorithm bias with coordinate descent on boosting models)
- Margins, Shrinkage, and Boosting (Algorithm bias with coordinate descent on boosting models)
- Implicit Regularization via Hadamard Product Over-Parametrization in High-Dimensional Linear Regression (Algorithm bias toward sparse solutions in linear regression)
- Implicit Regularization for Optimal Sparse Recovery (Algorithm bias toward sparse solutions in linear regression)
- Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations (Algorithm bias in factorization)
- Implicit Regularization in Deep Matrix Factorization (Algorithm bias in linear DNNs)
- Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced (Algorithm bias in DNNs)
- Implicit Regularization in Deep Learning May Not Be Explainable by Norms (Difficulty in characterizing algorithm bias)
- On Margin Maximization in Linear and ReLU Networks (Difficulty in characterizing algorithm bias)
- Reconciling modern machine learning practice and the bias-variance trade-off (Initializes double descent study in modern machine learning)
- A brief prehistory of double descent (Double descent in early literature)
- Deep Double Descent: Where Bigger Models and More Data Hurt (Double descent on computer vision datasets)
- Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers (New explanation of the success of random forests and Adaboost as interpolating predictors)
- Benign Overfitting in Linear Regression (Bias-variance characterization for linear regression with least L2 norm solution)
- Benign overfitting in ridge regression (Bias-variance characterization for linear regression with least L2 norm solution, as well as ridge regression)
- Just interpolate: Kernel “ridgeless” regression can generalize (Bias-variance characterization of the least L2 norm solution for kernel regression)
- On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels (Bias-variance characterization of the least L2 norm solution for kernel regression)
- Surprises in High-Dimensional Ridgeless Least Squares Interpolation (Bias-variance characterization of the least L2 norm solution for kernel regression)
- Linearized two-layers neural networks in high dimension (Bias-variance characterization of 1.5-layer NN in the linearization regime)
- When Do Neural Networks Outperform Kernel Methods? (Bias-variance characterization of 1.5-layer NN in the linearization regime)
- Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration (Bias-variance characterization of 1.5-layer NN in the linearization regime)
- The generalization error of random features regression: Precise asymptotics and double descent curve (Bias-variance characterization of 1.5-layer NN in the linearization regime)
- The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training (Bias-variance characterization of 1.5-layer NN in the linearization regime)
- The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks (Bias-variance characterization of 1.5-layer NN in the linearization regime)

#### Invariance

- Surveys