[edit]

# DALI/ELLIS: Theory, Algorithms and Computations of Modern Learning Systems

[edit]

### Room: Conference room 7

9:30- 9:40

**Welcome**Francis Bach

9:40- 10:10

**Some thoughts and questions towards a statistical understanding of DNNs**Ingo Steinwart

10:10- 10:40

**Automated Data Summarization for Scalability in Bayes and Beyond**Tamara Broderick

**Coffee Break**

11:30- 12:15

**Plenary Discussion I**Mario Figueireido, Julien Mairal

12:15- 13:00

**Plenary Discussion II**Philipp Hennig, Thomas Schön

**Lunch Break**

16:00- 16:30

**Bandit Principal Component Analysis**Gergely Neu

16:30- 17:00

**Scaling up optimal kernel methods for large scale machine learning**Alessandro Rudi

17:00- 17:30

**Differentiable Ranking and Sorting using Optimal Transport**Marco Cuturi

**Coffee Break**

18:00- 18:30

**Implicit Regularization for Optimal Sparse Recovery**Patrick Rebeschini

18:30- 19:00

**Multi-agent learning**Vianney Perchet

19:00- 19:30

**Post-selection inference for nonlinear feature selection**Jean-Philippe Vert

19:30- 20:30

**ELLIS assembly**

21:00

**Banquet**

# Abstracts

**Some thoughts and questions towards a statistical understanding of DNNs**

#### Abstract

So far, our statistical understanding of deep neural networks (DNNs) is rather limited, in particular if over-parametrized DNNs are considered. Part of the reasons for this lack of understanding is the fact that for such large DNNs the tools of classical statistical learning theory can no longer be applied. Nonetheless, over-parametrized DNNs are typically performing well in practice. In this talk, I will discuss some thoughts and possible questions that may be relevant for a successful end-to-end analysis of DNNs

**Automated Data Summarization for Scalability in Bayes and Beyond**

#### Abstract

Many algorithms take prohibitively long to run on modern, large data sets. But even in complex data sets, many data points may be at least partially redundant for some task of interest. So one might instead construct and use a weighted subset of the data (called a "coreset") that is much smaller than the original dataset. Typically running algorithms on a much smaller data set will take much less computing time, but it remains to understand whether the output can be widely useful. (1) In particular, can running an analysis on a smaller coreset yield answers close to those from running on the full data set? (2) And can useful coresets be constructed automatically for new analyses, with minimal extra work from the user? We answer in the affirmative for a wide variety of problems in Bayesian inference. We demonstrate how to construct "Bayesian coresets" as an automatic, practical pre-processing step. We prove that our method provides geometric decay in relevant approximation error as a function of coreset size. Empirical analysis shows that our method reduces approximation error by orders of magnitude relative to uniform random subsampling of data. We discuss a number of directions and open problems beyond using a smaller effective number of data points in Bayesian inference -- including, but not limited to, summarizing data features (large p vs. large N), further challenges to automation, using other probabilistic methods, other useful reweightings of data, and theoretical understanding of fundamental limitations in learning.

**Bandit Principal Component Analysis**

#### Abstract

We consider a partial-feedback variant of the well-studied online PCA problem where a learner attempts to predict a sequence of d-dimensional vectors in terms of a quadratic loss, while only having limited feedback about the environment's choices. We focus on a natural notion of bandit feedback where the learner only observes the loss associated with its own prediction. Based on the classical observation that this decision-making problem can be lifted to the space of density matrices, we propose an algorithm that is shown to achieve a regret of O(d^{3/2}T) after T rounds in the worst case. We also prove data-dependent bounds that improve on the basic result when the loss matrices of the environment have bounded rank or the loss of the best action is bounded. One version of our algorithm runs in O(d) time per trial which massively improves over every previously known online PCA method. We complement these results by a lower bound of Ω(dT^{1/2}).

**Scaling up optimal kernel methods for large scale machine learning**

#### Abstract

Kernel methods provide a principled way to perform non linear, nonparametric learning. They rely on solid functional analytic foundations and enjoy optimal statistical properties. However, at least in their basic form, they have limited applicability in large scale scenarios because of stringent computational requirements in terms of time and especially memory. In this talk we take a substantial step in scaling up kernel methods analyzing novel algorithms techniques that allow to efficiently process millions of points. The algorithms are derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning. Our theoretical analysis shows that optimal statistical accuracy is achieved requiring essentially O(n) memory and O(n sqrt(n)) time. An extensive experimental analysis on large scale datasets shows that, even with a single machine, the analyzed approach outperforms previous state of the art solutions, which exploit parallel/distributed architectures.

**Differentiable Ranking and Sorting using Optimal Transport**

#### Abstract

Differentiable Ranking and Sorting using Optimal Transport: We consider in this work proxy operators for ranking and sorting that are differentiable. To do so, we leverage the +fact that sorting can be seen as a particular instance of the optimal transport (OT) problem between two univariate uniform probability measures, the first measure being supported on the family of input values of interest, the second supported on a family of increasing target values (e.g. 1,2,...,n if the input array has n elements). Building upon this link, we propose generalized rank and sort operators by considering arbitrary target measures supported on m values, where m can be smaller than n. We recover differentiable algorithms by adding to this generic OT problem an entropic regularization, and approximate its outputs using Sinkhorn iterations. We call these operators soft-ranks and soft-sorts. We show that they are competitive with the recently proposed neuralsort (grover, 2019). To illustrate their versatility, we use the soft-rank operator to propose a new classification training loss that is a differentiable proxy of the 0/1 loss. Using the soft-sort operator, we propose a new differentiable loss for trimmed regression.

**Implicit Regularization for Optimal Sparse Recovery**

#### Abstract

Implicit Regularization for Optimal Sparse Recovery: Ridge regression is a fundamental paradigm in machine learning and statistics, and it has long been known to be closely connected to the implicit regularization properties of gradient descent methods, cf. early stopping. Over the past decade, this connection has sparked research into a variety of directions aimed at developing computationally efficiency estimators, including acceleration, mini-batching, averaging, sketching, sub-sampling, preconditioning, and decentralization. Sparse recovery is another cornerstone of modern statistics and learning frameworks. Yet, perhaps surprisingly, here the connection to implicit regularization is not as well developed. Most results in the literature only involve limit statements (holding at convergence, for infinitesimal step sizes), apply to regimes with no (or limited) noise, and do not focus on computational efficiency. In this talk, we address the following question: Can we establish an implicit regularization theory for gradient descent to yield optimal sparse recovery in noisy settings, achieving minimax rates with the same cost of reading the data? We will highlight the key ideas to obtain some first results here, along with a few promising research directions. (Based on joint work with Tomas Vaškevičius and Varun Kanade)

**Multi-agent learning**

#### Abstract

Learning algorithms are not used by a single agent on its own; the decisions learned and/or the action taken will impact - and will be impacted by - other agents performing similar task at the same time, either competitively or cooperatively. In this talk, I will investigate several important use cases where such interactions arise and discuss the (unfortunately quite often) inappropriate models introduced and studied theoretically. Hopefully, we will be able to come out with new frameworks, assumptions, algorithms so that theory could be used in practice, in the near future.

**Post-selection inference for nonlinear feature selection**

#### Abstract

Post-selection inference for nonlinear feature selection: Model selection is an essential task for many applications in scientific discovery. The most common approaches rely on univariate linear measures of association between each feature and the outcome. Such classical selection procedures fail to take into account nonlinear effects and interactions between features. Kernel-based selection procedures have been proposed as a solution. However, current strategies for kernel selection fail to measure the significance of a joint model constructed through the combination of the basis kernels. In the present work, we exploit recent advances in post-selection inference to propose a valid statistical test for the association of a joint model of the selected kernels with the outcome. The kernels are selected via a step-wise procedure which we model as a succession of quadratic constraints in the outcome variable.