Elenco seminari del ciclo di seminari
“MATHEMATICAL METHODS AND MODELS IN MACHINE LEARNING”

Artificial Intelligence is profoundly and quickly changing the technological profile of our society and yet machine learning, its disruptive spearhead, although filled with brilliant heuristic solutions has almost no theoretical basis from a strictly scientific point of view. The gap between the increasing performance of deep learning and its understanding is therefore a very relevant scientific theme that calls for a strong participation in research efforts from the fields of Mathematics and Mathematical Physics. The identification of the correct mathematical models and their analysis is of fundamental importance toward the discovery of a theory that could allow to exploit at best the potential and understand the limits of this technology. The purpose of the conference is to present recent results on mathematical methods and models related to machine learning and link researchers coming from different areas.
In this talk will present a new generative model of structured data, the hidden manifold model, and analyse learning dynamics and phase diagrams for such structured data.
Recurrent Neural Networks (RNN) are powerful tools to learn computational tasks from data. How the tasks to be performed are simultaneously encoded and the input/output data are represented in a high-dimensional network is an important question. I will consider two related problems, both inspired from computational neuroscience: (1) how multiple low-dimensional maps (environments) can be embedded in a RNN, (2) how multiplexed integrations of velocity signals can be carried out in the RNN to update positions in those maps. I will discuss the nature of the representations found in artificial RNN, and compare them to experimental recordings in the mammal brain.
2020
27 aprile
Jean Christophe Mourrat
Seminario di fisica matematica
I will describe a conjecture for the limit free energy of mean-field spin glasses with a bipartite structure, which I could prove to be an upper bound for the true limit. The conjectured limit is described in terms of the solution of an infinite-dimensional Hamilton-Jacobi equation. A fundamental difficulty of the problem is that the nonlinearity in this equation is not convex. I will also question the possibility to characterize this conjectured limit in terms of a saddle-point problem
We study the performance of stochastic gradient descent in high-dimensional inference tasks. Our focus is on the initial ``search'' phase where the algorithm is far from a trust region and the loss landscape is highly non-convex. We develop a classification of the difficulty of this problem, namely whether the problem requires linear, quasilinear, or polynomially many samples in the dimension to achieve weak recovery of the parameter. This classification depends on an intrinsic property of the population loss which we call the ``information exponent''. We illustrate our approach by applying it to a wide variety of estimation tasks such as parameter estimation for generalize linear models, two component Gaussian mixture models, phase retrieval, and spiked matrix and tensor models, as well as supervised learning for singe-layer networks with general activation functions. In this latter case, our results translate to the difficulty of this task for teacher-student networks in terms of the Hermite decomposition of the activation function.
In a canonical supervised learning setting, we are given n data samples, each comprising a feature vector and a label, or response variable. We are asked to learn a function f that can predict the the label associated to a new --unseen-- feature vector. How is it possible that the model learnt from observed data generalizes to new points? Classical learning theory assumes that data points are drawn i.i.d. from a common distribution and argue that this phenomenon is a consequence of uniform convergence: the training error is close to its expectation uniformly over all models in a certain class. Modern deep learning systems appear to defy this viewpoint: they achieve training error that is significantly smaller than the test error, and yet generalize well to new data. I will present a sequence of high-dimensional examples in which this phenomenon can be understood in detail. [Based on joint work with Song Mei, Feng Ruan, Youngtak Sohn, Jun Yan]
The Hopfield model is considered in a teacher-student scenario as a problem of unsupervised learning with Restricted Boltzmann Machines (RBM). For different choices of the priors for units and weights, the replica symmetric phase diagram of random RBM’s is analyzed and in particular the paramagnetic phase boundary is presented as directly related to the optimal size of the training set necessary for a good generalization. The connection between the direct and inverse problem is pointed out by showing that inference can be efficiently performed by suitably adapting both standard learning techniques and standard approaches to the direct problem.
We introduce a new version of the replica trick for disordered models, based on the interpolation on the number of replicas, and not on analytic continuation. We show that this strategy allows to find the order parameter in the general case, and the associated variational principle. The method works also for multispecie models. We show some applications to neural networks, and related systems.
2020
28 aprile
Optimal transport (OT) has recently gained lot of interest in machine learning. It is a natural tool to compare in a geometrically faithful way probability distributions. It finds applications in both supervised learning (using geometric loss functions) and unsupervised learning (to perform generative model fitting). OT is however plagued by the curse of dimensionality, since it might require a number of samples which grows exponentially with the dimension. In this talk, I will review entropic regularization methods which define geometric loss functions approximating OT with a better sample complexity. More information and references can be found on the website of our book "Computational Optimal Transport " https://optimaltransport.github.io/
2020
28 aprile
We prove that the binary classifiers of bit strings generated by random wide deep neural networks with ReLU activation function are biased towards simple functions. The simplicity is captured by the following two properties. For any given input bit string, the average Hamming distance of the closest input bit string with a different classification is at least sqrt(n / (2π log n)), where n is the length of the string. Moreover, if the bits of the initial string are flipped randomly, the average number of flips required to change the classification grows linearly with n. These results are confirmed by numerical experiments on deep neural networks with two hidden layers, and settle the conjecture stating that random deep neural networks are biased towards simple functions. This conjecture was proposed and numerically explored in [Valle Pérez et al., ICLR 2019] to explain the unreasonably good generalization properties of deep learning algorithms. The probability distribution of the functions generated by random deep neural networks is a good choice for the prior probability distribution in the PAC-Bayesian generalization bounds. Our results constitute a fundamental step forward in the characterization of this distribution, therefore contributing to the understanding of the generalization properties of deep learning algorithms. Advances in Neural Information Processing Systems 32, 1964-1976 (2019)
I will report on recent work with Aukosh Jagannath and Reza Gheissari on the performance of gradient descent algorithms during the initial phase of the minimization of high dimensional loss functions, and try to understand how they manage to escape the region of maximal entropy.