Topics in Modern Machine Learning
Theme of this iteration
theoretical foundations of deep learning… examines deep learning through the lens of classic and modern approximation, optimization, and learning theory
Course info
Syllabus: PDF
Instructor: Professor Ju Sun Email: jusun AT umn.edu
When/Where: Mon/Wed 2:30 – 3:45pm @ Amundson Hall 156
Office Hours: Mon/Wed 4 – 5pm @ Zoom or my office
Schedule
General readings: PNAS Colloquium on The Science of Deep Learning, Dec 2020. The course is organized around six thrusts.
Approximation
- Surveys
- Papers
- Visual proof of universality for shallow networks
- Approximation theory of the MLP model in neural networks (Survey of UAT focused on shallow networks)
- Multilayer feedforward networks are universal approximators (UAT of both deep and shallow networks)
- Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations (ReLU networks with bounded width and unbounded depth)
- Approximating continuous functions by ReLU nets of minimal width (ReLU networks with bounded width and unbounded depth)
- Error bounds for approximations with deep ReLU networks (ReLU network approximation of smooth functions)
- Error bounds for approximations with deep relu neural networks in $W^{s, p}$ norms (ReLU network approximation of smooth functions)
- Provable approximation properties for deep neural networks (Approximation of functions on manifolds)
- The power of deeper networks for expressing natural functions (Exponential separation of deep and shallow networks in approximating polynomials)
- Benefits of depth in neural networks (Exponential separation of deep and shallow networks due to composition of ReLU activated layers)
- Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review (separation of deep and shallow networks on compositional functions)
- High-dimensional data analysis: The blessings and curses of dimensionality (Donoho’s 2000 talk on curse/blessing of dimensionality in modern data analysis)
- Universal approximations of invariant maps by neural networks (Approximation of invariant and equivariant functions wrt finite groups)
- Equivalence of approximation by convolutional neural networks and fully-connected networks (Approximation of equivariant functions using CNNs)
- Further readings
- The gap between theory and practice in function approximation with deep neural networks
- Proof of the Theory-to-Practice Gap in Deep Learning via Sampling Complexity bounds for Neural Network Approximation Spaces
- A deep network construction that adapts to intrinsic dimensionality beyond the domain (Approximation of functions on manifolds with refined rates)
Optimization \& Generalization
- Surveys
- Chaps 1–2, 5, 6 of The Modern Mathematics of Deep Learning
- Deep learning: a statistical viewpoint
- Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation
- A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning
- Papers
- Recent Theoretical Advances in Non-Convex Optimization (computational complexity of algorithms for nonconvex problems)
- Optimization for deep learning: theory and algorithms (deep learning algorithms, tricks, and theories thereof)
- From Symmetry to Geometry: Tractable Nonconvex Problems (problems that are “essentially” convex up to symmetries)
- Accelerated methods for $\alpha$-weakly-quasi-convex problems (early results on optimizing star-convex functions)
- Primal-dual accelerated gradient methods with small-dimensional relaxation oracle (early results on optimizing star-convex functions)
- Near-Optimal Methods for Minimizing Star-Convex Functions and Beyond (state-of-the-art results on optimizing star-convex functions)
- An Alternative View: When Does SGD Escape Local Minima? (optimizing of neural networks via star convexity)
- SGD Converges to Global Minimum in Deep Learning via Star-convex Path (optimizing of neural networks via star convexity)
- Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition (state-of-the-art convergence results based on PL conditions)
- On the linearity of large non-linear models: when and why the tangent kernel is constant (explains when and why model linearization makes sense for nonlinear models)
- Loss landscapes and optimization in over-parameterized non-linear systems and neural networks (non-linear systems and neural networks satisfy the PL conditions under certain conditions)
- On Lazy Training in Differentiable Programming (when and why linearization makes sense, and when not—esp. in practice)
- Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path? (another proof of PL conditions imply global convergence and short optimization paths for undertermined nonlinear models)
- Towards moderate overparameterization: global convergence guarantees for training shallow neural networks (specialization and improvement of the above result for shallow neural networks)
- {Euclidean, metric, and Wasserstein} gradient flows: an overview (a nice tutorial on gradient flows)
- Computational Optimal Transport (optimal transport problems, Wasserstein distances, and (Chap 9) Wasserstein gradient flow)
- Gradient Descent on Infinitely Wide Neural Networks: Global Convergence and Generalization (overview of global convergence and generalization of 2-layer NNs using mean-field analysis)
- On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport (the 1st technical paper behind the above review paper)
- Implicit Bias of Gradient Descent for Wide Two-layer Neural Networks Trained with the Logistic Loss (the 2nd technical paper behind the above review paper)
- A Mean Field View of the Landscape of Two-Layers Neural Networks (complementary mean-field analysis)
- Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit (complementary mean-field analysis)
- A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks (mean-field analysis of deep networks)
- Introduction to Statistical Learning Theory (quick review of classic statistical learning theory)
- Theory of Classification: a Survey of Some Recent Advances (quick review of more recent developments of statistical learning theory)
- Learning without Concentration (dealing with heavy-tailed data and functions in learning theory using one-side concentration)
- Extending the scope of the small-ball method (dealing with heavy-tailed data and functions in learning theory using one-side concentration)
- Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks (state-of-the-art VC dimension results for DNNs)
- Rademacher and Gaussian Complexities: Risk Bounds and Structural Results (introduction of Rademacher and Gaussian complexities; Rad complexity of 2-layer NN)
- Spectrally-normalized margin bounds for neural networks (state-of-the-art Rad complexity estimation for DNNs)
- Size-Independent Sample Complexity of Neural Networks (another state-of-the-art Rad complexity estimation for DNNs)
- Understanding deep learning (still) requires rethinking generalization (distribution-independent bounds unlikely to explain the generalization of DL)
- To understand deep learning we need to understand kernel learning (distribution-dependent bounds may also struggle to explain the generalization of DL)
- Uniform convergence may be unable to explain generalization in deep learning (potential failure of two-sided uniform convergence results for DL)
- Failures of model-dependent generalization bounds for least-norm interpolation (a complementary (stronger) version of the above)
- PAC-Bayesian Learning of Linear Classifiers (PAC-Bayesian derandomization for linear predictors)
- A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks (PAC-Bayesian results for DNNs)
- Studying generalization in deep learning via PAC-Bayes (tutorial talk by Gintare Karolina Dziugaite)
- PAC-Bayesian Transportation Bound (Abstract view of PAC Bayesian, and its connection to transportation bounds)
- Stability and Generalization (classic results connecting stability and generalization)
- Learnability, Stability and Uniform Convergence (classic results connecting stability and generalization)
- Sharper bounds for uniformly stable algorithms (state-of-the-art generalization bounds under uniform stability)
- Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses (algorithm stability of SGD on convex problems)
- Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent (refined notion of stability for SGD on nonconvex problems)
- High probability generalization bounds for uniformly stable algorithms with nearly optimal rate (high probability generalization bounds based on stability for convex problems)
- High-probability Convergence Bounds for Non-convex Stochastic Gradient Descent (high probability generalization bounds based on stability for nonconvex problems under certain conditions)
- The Implicit Bias of Gradient Descent on Separable Data (Algorithm bias of GD on several classical learning models)
- Boosting with early stopping: Convergence and consistency (Algorithm bias with coordinate descent on boosting models)
- Margins, Shrinkage, and Boosting (Algorithm bias with coordinate descent on boosting models)
- Implicit Regularization via Hadamard Product Over-Parametrization in High-Dimensional Linear Regression (Algorithm bias toward sparse solutions in linear regression)
- Implicit Regularization for Optimal Sparse Recovery (Algorithm bias toward sparse solutions in linear regression)
- Algorithmic Regularization in Over-parameterized Matrix Sensing and Neural Networks with Quadratic Activations (Algorithm bias in factorization)
- Implicit Regularization in Deep Matrix Factorization (Algorithm bias in linear DNNs)
- Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced (Algorithm bias in DNNs)
- Implicit Regularization in Deep Learning May Not Be Explainable by Norms (Difficulty in characterizing algorithm bias)
- On Margin Maximization in Linear and ReLU Networks (Difficulty in characterizing algorithm bias)
- Reconciling modern machine learning practice and the bias-variance trade-off (Initializes double descent study in modern machine learning)
- A brief prehistory of double descent (Double descent in early literature)
- Deep Double Descent: Where Bigger Models and More Data Hurt (Double descent on computer vision datasets)
- Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers (New explanation of the success of random forests and Adaboost as interpolating predictors)
- Benign Overfitting in Linear Regression (Bias-variance characterization for linear regression with least L2 norm solution)
- Benign overfitting in ridge regression (Bias-variance characterization for linear regression with least L2 norm solution, as well as ridge regression)
- Just interpolate: Kernel “ridgeless” regression can generalize (Bias-variance characterization of the least L2 norm solution for kernel regression)
- On the Multiple Descent of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels (Bias-variance characterization of the least L2 norm solution for kernel regression)
- Surprises in High-Dimensional Ridgeless Least Squares Interpolation (Bias-variance characterization of the least L2 norm solution for kernel regression)
- Linearized two-layers neural networks in high dimension (Bias-variance characterization of 1.5-layer NN in the linearization regime)
- When Do Neural Networks Outperform Kernel Methods? (Bias-variance characterization of 1.5-layer NN in the linearization regime)
- Generalization error of random features and kernel methods: hypercontractivity and kernel matrix concentration (Bias-variance characterization of 1.5-layer NN in the linearization regime)
- The generalization error of random features regression: Precise asymptotics and double descent curve (Bias-variance characterization of 1.5-layer NN in the linearization regime)
- The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training (Bias-variance characterization of 1.5-layer NN in the linearization regime)
- The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks (Bias-variance characterization of 1.5-layer NN in the linearization regime)
Invariance
- Surveys