# Learning Theory

### Room: Sala Ginestra

Speakers:

- Francis Bach
- Gilles Blanchard
- Remi Gribonval
- Matthias Hein
- Alessandro Lazaric
- Haipeng Luo
- Andreas Maurer
- Shai Ben-David

09:30- 10:05

**How far are we from having a satisfactory theory of clustering?**Shai Ben-David, Universitys of Waterloo

10:05- 10:40

**Analysis of Nystrom Method with Sequential Ridge Leverage Score Sampling**Alessandro Lazaric, INRIA Lille - Nord Europe

10:40- 11:15

**Is adaptive early stopping possible in statistical inverse problems?**Gilles Blanchard, University of Potsdam

Coffee Break

11:45- 12:20

**Sharp Analysis of Random Feature Expansions**Francis Bach, Centre de Recherche INRIA de Paris

12:20- 13:00

**Efficient Second Order Online Learning via Sketching**Haipeng Luo, Princeton University

Lunch and Afternoon Activities

18:00- 18:40

**Spectral convergence of the graph Laplacian**Matthias Hein, Saarland University

18:40- 19:20

**Bounding the complexity of composite function classes**Andreas Maurer, Stemmer Imaging

19:20- 20:00

**Compressive Learning: Projections, Learning, and Sparsity for Efficient Data Processing**Rémi Gribonval, Centre de Recherche INRIA Rennes - Bretagne Atlantique, France

Dinner

# Abstracts

**How far are we from having a satisfactory theory of clustering?**

Shai Ben-David, Universitys of Waterloo
#### Abstract

Unsupervised learning, taking advantage of huge amounts of raw data available, is widely recognized as one of the most important challenges facing machine learning nowadays. For supervised tasks, machine learning theory has been successful in several respects; providing significant understanding of machine learning tasks (in terms of the resources and tools required to address them), insights about the pros and cons of alternative learning paradigms and their parameter settings, and initiating the development of new algorithmic approaches. However, no such successes had been so far achieved in the unsupervised ML domain. I will focus on clustering and discuss two aspects in which theory could be of great use. The first is model selection - how should a user pick an appropriate clustering tool for a given clustering problem and tune up the parameters it requires? The second aspect I will address is the computational complexity of clustering. Once a clustering model (or objective) has been picked, the task becomes an optimization problem. While most of the clustering objective optimization problems are computationally infeasible, they are being carried routinely in practice. I will describe some of the recent attempts to understand this discrepancy.

**Analysis of Nystrom Method with Sequential Ridge Leverage Score Sampling**

Alessandro Lazaric, INRIA Lille - Nord Europe
#### Abstract

Large-scale kernel ridge regression (KRR) is limited by the need to store a large kernel matrix K_t. To avoid storing the entire matrix K_t, Nystrom methods subsample a subset of columns of the kernel matrix, and efficiently find an approximate KRR solution on the reconstructed kernel matrix. The chosen subsampling distribution in turn affects the statistical and computational tradeoffs. For KRR problems (Rudi et al., 2015; Alaoui and M. Mahoney, 2015) show that a sampling distribution proportional to the mph{ridge leverage scores} (RLSs) provides strong reconstruction guarantees for the approximated kernel. While exact RLSs are as difficult to compute as a KRR solution, we may be able to approximate them well enough. In this paper, we study KRR problems in a sequential setting and introduce the INK-Estimate algorithm, that incrementally computes the RLSs estimates. INK Estimate maintains a small sketch of K_t, that at each step is used to compute an intermediate estimate of the RLSs. First, our sketch update does not require access to previously seen columns, and therefore a single pass over the kernel matrix is sufficient. Second, the algorithm requires a fixed, small space budget to run dependent only on the effective dimension of the kernel matrix. Finally, our sketch provides strong approximation guarantees on the reconstruction error, and on the statistical risk of the approximate KRR solution at any time, because all our guarantees hold at any intermediate step.

**Is adaptive early stopping possible in statistical inverse problems?**

Gilles Blanchard, University of Potsdam
#### Abstract

(Joint work with M. Hoffmann and M. Reiss) We consider a standard setting of statistical inverse problem, taking the form of the Gaussian sequence model with D observed noisy coefficients. Consider the simple family of ”keep or kill” estimators depending on a cutoff index k_0. For the choice of the cutoff index, there exist a number of well-known methods achieving oracle adaptivity (i.e. data-dependent choice of k_0 whose performance is comparable to the unknown optimal one), such as penalization and Lepski’s method. However, they have in common that the estimators for all values of k_0 have to be computed first and compared to each other in some way. Contrast this to an ”early stopping” approach where we would like to compute iteratively the estimators for k_0= 1, 2, . . . and have to decide to stop at some point without being allowed to compute the following estimators. Is oracle adaptivity possible then? This question is motivated by settings where computing estimators for larger k_0 requires more computational cost; furthermore some form of early stopping is most often used in practice. We propose a precise mathematical formulation of this question and provide upper and lower bounds on what is achievable.

** Sharp Analysis of Random Feature Expansions**

Francis Bach, Centre de Recherche INRIA de Paris
#### Abstract

Random feature expansions provide a simple way to avoid the usual quadratic running-time complexity of kernel methods. In this talk, I will present recent results about the approximation properties of these expansions. In particular, I will provide improved bounds on the number of features needed for a given approximation quality. I will also draw links with the problem of approximating integrals from finite sums, that is, the kernel quadrature problem.

**Efficient Second Order Online Learning via Sketching**

Haipeng Luo, Princeton University
#### Abstract

We propose Sketched Online Newton (SON), an online second order learning algorithm that enjoys substantially improved regret guarantees for ill-conditioned data. SON is an enhanced version of the Online Newton Step, which, via sketching techniques enjoys a linear running time. We further improve the computational complexity to linear in the number of nonzero entries by creating sparse forms of the sketching methods (such as Oja's rule) for top eigenvector extraction. Together, these algorithms eliminate all computational obstacles in previous second order online learning approaches. This is joint work with Alekh Agarwal, Nicolo Cesa-Bianchi and John Langford.

**Bounding the complexity of composite function classes**

Andreas Maurer, Stemmer Imaging
#### Abstract

The current interest in deep learning motivates the search for methods to analyze the complexity of layered function classes. The talk presents a general bound on the Gaussian complexity of the composition of vector-valued function classes in terms of the complexities of the respective components. Applications exhibit the benefits of implicit feature learning in layered models of multi-task learning.

**Compressive Learning: Projections, Learning, and Sparsity for Efficient Data Processing**

Rémi Gribonval, Centre de Recherche INRIA Rennes - Bretagne Atlantique, France
#### Abstract

The talk will discuss generalizations of sparse recovery guarantees and compressive sensing to the context of machine learning. Assuming some low-dimensional model on the probability distribution of the data, we will see that in certain scenarios it is indeed possible to (randomly) compress a large data- collection into a reduced representation, of size driven by the complexity of the learning task, while preserving the essential information necessary to process it. Two case studies will be given: compressive clustering, and compressive Gaussian Mixture Model estimation, with an illustration on large-scale model-based speaker verification.