Recent advances in Machine Learning (ML) have revolutionized our ability to solve complex problems in a myriad of application domains. Yet, just as empirical data plays a fundamental role in the development of such applications, the process of designing these methods has also remained empirical: we have learned which of the known methods tend to perform better for certain types of problems, and have developed intuition guiding our discovery of new methods.
In contrast, classical algorithmic theory provides tools directly addressing the mathematical core of a problem, and clear theoretical justifications motivate powerful design techniques. At the heart of this process is the analysis of the correctness and time/space efficiency of an algorithm, providing actionable bounds and guarantees. Problems themselves may be characterized by bounding the performance of any algorithm, providing a meaningful reference point to which concrete algorithms may be compared. While ML models may appear to be an awkward fit for such techniques, some research in the area has succeeded in obtaining results with the “definitive” flavour associated with algorithms, complementary to empirical ones. Are such discoveries bound to be exceptions, or can they be part of a new algorithmic theory?
The GoundedML workshop seeks to bring together researchers from both the algorithmic theory and machine learning communities, starting a dialogue on how ideas from theoretical algorithm design can inspire and guide future research in machine learning.
Fri 5:45 a.m. - 6:00 a.m.
|
Opening Remarks
(
Remarks
)
|
Perouz Taslakian · Pierre-André Noël · David Vazquez · Jian Tang · Xavier Bresson 🔗 |
Fri 6:00 a.m. - 6:45 a.m.
|
Learning Algorithmic Tasks with Graph Neural Networks: Generalization and Extrapolation
(
Invited Talk
)
SlidesLive Video » Graph Neural Networks (GNNs) have become a popular tool for learning algorithmic tasks, in particular related to combinatorial optimization. In this talk, we will focus on the “algorithmic reasoning” task of learning a full algorithm. While GNNs have shown promising empirical results, their generalization properties are less well understood. In particular, the results are affected by the representational power, task and data properties, and inductive biases from architectural structure and learning algorithm. The talk summarizes effects of all of these. For instance, empirically, we observe an interplay between the structure of the task — or target algorithm — and the inductive biases of the architecture: although many networks may be able to represent a task, some architectures learn it better than others. We formalize this relationship via “algorithmic alignment”, and study empirical and theoretical implications for generalization within and out of the training distribution. Our results suggest different alignment conditions for within-distribution generalization and for extrapolation, where, for extrapolation with ReLU GNNs, nonlinearities play an important role. This talk is based on joint work with Keyulu Xu, Jingling Li, Mozhi Zhang, Simon S. Du, Ken-ichi Kawarabayashi, Vikas Garg and Tommi Jaakkola. |
Stefanie Jegelka 🔗 |
Fri 6:45 a.m. - 7:30 a.m.
|
Enabling Empirically and Theoretically Sound Algorithmic Alignment
(
Invited Talk
)
SlidesLive Video » Neural networks that are able to reliably execute algorithmic computation may hold transformative potential to both machine learning and theoretical computer science. On one hand, they could enable the kind of extrapolative generalisation scarcely seen with deep learning models. On another, they may allow for running classical algorithms on inputs previously considered inaccessible to them. Both of these promises are shepherded by the neural algorithmic reasoning blueprint, which I have recently proposed in a position paper alongside Charles Blundell. On paper, this is a remarkably elegant pipeline for reasoning on natural inputs which carefully leverages the tried-and-tested power of deep neural networks as feature extractors. However, in practice, its success rests on successful realisation of algorithmic alignment -- neural networks that are capable of executing algorithmic computations in a high-dimensional latent space, ideally in a manner that extrapolates outside of the training set. In this talk, I will outline some of the recent work we have done on strengthening algorithmic alignment from both an empirical and a theoretical point of view. These efforts include a dataset of algorithmic reasoning tasks, to be used as a bootstrapping basis, as well as formalising the theoretical connection between (graph) neural networks and dynamic programming using the tools of category theory and abstract algebra. |
Petar Veličković 🔗 |
Fri 7:30 a.m. - 7:40 a.m.
|
Coffee Break
|
🔗 |
Fri 7:40 a.m. - 8:25 a.m.
|
Local Signal Adaptivity: Feature learning in Neural networks beyond kernels
(
Invited Talk
)
Neural networks have been shown to significantly outperform kernel methods (including neural tangent kernels) in problems such as image classification. Most theoretical explanations of this performance gap use a complex or stylized hypothesis class to explain this gap, which leads to a disconnect between the theory and practice. In this talk, I will demonstrate a simple hypothesis class inspired from natural images which explains this performance gap based on finding a sparse signal in the presence of noise, suggesting an improved denoising ability or more generally improved ability to discard irrelevant features using neural networks rather than kernels. Specifically, we show that, for a simple data distribution with sparse signal amidst high-variance noise, a convolutional neural network trained using stochastic gradient descent learns to threshold out the noise and find the signal. On the other hand, the corresponding neural tangent kernel, with a fixed set of predetermined features, is unable to adapt to the signal in this manner. This is joint work with Stefani Karp, Ezra Winston and Yuanzhi Li. |
Aarti Singh 🔗 |
Fri 8:25 a.m. - 8:35 a.m.
|
Graph Attention Retrospective
(
Contributed Talk
)
SlidesLive Video » Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular type of models is graph attention networks. These models were introduced to allow a node to aggregate information from the features of neighbor nodes in a non-uniform way in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we study theoretically this expected behaviour of graph attention networks.We prove multiple results on the performance of the graph attention mechanism for the problem of node classification for a contextual stochastic block model. Here, the features of the nodes are obtained from a mixture of Gaussians and the edges from a stochastic block model where the features and the edges are coupled in a natural way. First, we show that in an "easy" regime, where the distance between the means of the Gaussians is large enough, graph attention maintains the weights of intra-class edges and significantly reduces the weights of the inter-class edges. As a corollary, we show that this implies perfect node classification independent of the weights of inter-class edges. However, a classical argument shows that in the "easy" regime, the graph is not needed at all to classify the data with high probability. In the ``hard'' regime, we show that every attention mechanism fails to distinguish intra-class from inter-class edges. We evaluate our theoretical results on synthetic and real-world data. |
Kimon Fountoulakis · Amit Levi · Shenghao Yang · Aseem Baranwal · Aukosh Jagannath 🔗 |
Fri 8:35 a.m. - 8:45 a.m.
|
The CLRS Algorithmic Reasoning Benchmark
(
Contributed Talk
)
SlidesLive Video » Learning representations of algorithms is an emerging area of machine learning, seeking to bridge concepts from neural networks with classical algorithms. Several important works have investigated whether neural networks can effectively reason like algorithms, typically by learning to execute them. The common trend in the area, however, is to generate targeted kinds of algorithmic data to evaluate specific hypotheses, making results hard to transfer across publications, and increasing the barrier of entry. To consolidate progress and work towards unified evaluation, we propose the CLRS Algorithmic Reasoning Benchmark, covering classical algorithms from the Introduction to Algorithms textbook. Our benchmark spans a variety of algorithmic reasoning procedures, including sorting, searching, dynamic programming, graph algorithms, string algorithms and geometric algorithms. We perform extensive experiments to demonstrate how several popular algorithmic reasoning baselines perform on these tasks, and consequently, highlight links to several open challenges. |
Petar Veličković · Adria Puigdomenech Badia · David Budden · Razvan Pascanu · Andrea Banino · Misha Dashevskiy · Raia Hadsell · Charles Blundell 🔗 |
Fri 8:45 a.m. - 9:15 a.m.
|
Poster Session (through GatherTown) ( Poster Session ) link » | 🔗 |
Fri 9:15 a.m. - 9:50 a.m.
|
Lunch Break
|
🔗 |
Fri 9:50 a.m. - 10:20 a.m.
|
Panel/Q&A for Morning Invited Talks
(
Panel/Q&A
)
|
Stefanie Jegelka · Petar Veličković · Aarti Singh · Perouz Taslakian · Pierre-André Noël · David Vazquez · Jian Tang · Xavier Bresson 🔗 |
Fri 10:20 a.m. - 11:05 a.m.
|
Deep learning theory vs traditional theory of algorithms
(
Invited Talk (live)
)
Having spent significant part of my career in traditional algorithm design and complexity analysis before switching to theory of machine learning (especially deep learning), I will survey differences between the two fields. An important one is that machine learning often concerns a "ghost" objective (test loss, or performance on new downstream tasks) that was unavailable during the training. "Goodness" of the solution is undefined a priori, and theory necessarily has to create the notion of goodness as part of the analysis. It is increasingly clear that the key missing piece in this is better understanding of the trajectory of model parameters during training. For instance, classic notions of "size"/"capacity"/"representation power," or even the "goal" of the training, turn out to have no independent meaning without understanding the trajectory. This will be illustrated using several interesting theoretical developments as well as experiments from recent years, leading to an agenda for future work. |
Sanjeev Arora 🔗 |
Fri 11:05 a.m. - 11:50 a.m.
|
Optimization Algorithms in the Large: Exact Dynamics, Average-case Analysis, and Stepsize Criticality
(
Invited Talk
)
SlidesLive Video » In this talk, I will present a framework, inspired by random matrix theory, for analyzing the dynamics of optimization algorithms (e.g., 1st-order methods, stochastic gradient descent (SGD), and momentum) when both the number of samples and dimensions are large. Using this new framework, we show that the dynamics of optimization algorithms on a least squares problem with random data become deterministic in the large sample and dimensional limit. In particular, the limiting dynamics for stochastic algorithms are governed by a Volterra integral equation. This model predicts that SGD undergoes a phase transition at an explicitly given critical stepsize that ultimately affects its convergence rate, which we also verify experimentally. Finally, when input data is isotropic, we provide explicit expressions for the dynamics and average-case convergence rates. These rates show significant improvement over the worst-case complexities. |
Courtney Paquette 🔗 |
Fri 11:50 a.m. - 12:00 p.m.
|
Coffee Break
|
🔗 |
Fri 12:00 p.m. - 12:45 p.m.
|
Continuous cutting plane algorithms in integer programming
(
Invited Talk
)
SlidesLive Video » Cutting planes for mixed-integer linear programs (MILPs) are typically computed in rounds by iteratively solving optimization problems. Instead, we reframe the problem of finding good cutting planes as a continuous optimization problem over weights parametrizing families of valid inequalities. We propose a practical two-step algorithm, and demonstrate empirical gains when optimizing Gomory mixed-integer inequalities over various families of MILPs. Finally, a reinterpretation of our approach as optimization of the subadditive dual using a deep neural network is provided. |
Andrea Lodi 🔗 |
Fri 12:45 p.m. - 1:30 p.m.
|
Distribution-dependent generalization bounds for noisy, iterative learning algorithms
(
Invited Talk
)
Deep learning approaches dominate in many application areas. Still, our understanding of generalization in deep learning is incomplete. Despite progress uncovering and understanding phenomenal underlying strong generalization, there are significant gaps. In this talk, I will discuss barriers to understanding generalization using traditional tools from statistical learning theory and explain why any explanation of generalization in deep learning must be data-dependent. I will talk about the role of empirical evaluation of theories of deep learning and propose a framework for how we ought to evaluate theories empirically. I will then discuss my work on information-theoretic approaches to understanding generalization of noisy, iterative learning algorithms, such as Stochastic Gradient Langevin Dynamics, a noisy version of SGD. |
Gintare Dziugaite 🔗 |
Fri 1:30 p.m. - 1:40 p.m.
|
Coffee Break
|
🔗 |
Fri 1:40 p.m. - 2:10 p.m.
|
Panel/Q&A for Afternoon Invited Talks
(
Panel/Q&A
)
|
Sanjeev Arora · Courtney Paquette · Gintare Dziugaite · Andrea Lodi · Perouz Taslakian · Pierre-André Noël · David Vazquez · Jian Tang · Xavier Bresson 🔗 |
Fri 2:10 p.m. - 2:30 p.m.
|
Prizes and Closing Remarks
(
Remarks
)
|
Perouz Taslakian · Pierre-André Noël · David Vazquez · Jian Tang · Xavier Bresson 🔗 |
-
|
Dual Conservative Policy Update for Efficient Model-Based Reinforcement Learning
(
Poster (GatherTown)
)
link »
Model-based reinforcement learning (MBRL) algorithms by acquiring predictive models are data-efficient. Different from greedy model exploitation algorithms, provable MBRL algorithms based on optimism or posterior sampling are ensured to achieve the optimal performance asymptotically when with additional model complexity measures. However, such complexity is not polynomial for many model function classes, which poses challenges to reach the global optimum in practice. Due to the aggressive policy update and over-exploration, the convergence can be a very slow process and the policy may even end up suboptimal. Thus, in addition to the asymptotic guarantee, ensuring iterative policy improvement is the key to achieving high performance with finite timesteps. To this end, we propose Dual Conservative Policy Update (DCPU) for MBRL that involves a locally greedy update procedure and a conservative exploration update procedure. By greedily exploiting the local model and maximizing the expected value within the trust region, DCPU agents explore efficiently. We theoretically provide the iterative policy improvement bound of DCPU and show the monotonic improvement property under the Lipschitz assumption and with a proper constraint threshold. Besides, we prove the asymptotic optimality of DCPU with sublinear Bayes regret bound. Empirical results demonstrate the superiority of DCPU on several Mujoco tasks. |
Shenao Zhang 🔗 |
-
|
K-level SLOPE: Simplified and Adaptive Variable Selection for Optimization of Estimation Risk
(
Poster (GatherTown)
)
link »
Among sparse linear models, SLOPE generalizes the LASSO via an adaptive $l_1$ regularization that applies heavier penalties to larger entries of the estimator. To achieve such adaptivity in $n\times p$ problem, SLOPE requires a penalty sequence in $\R^p$ in contrast to a single penalty scalar as in the LASSO. Tuning the $\R^p$ SLOPE penalty in high dimension poses a challenge as the brute force search for the optimal penalty is computationally infeasible. In this work, we formally propose the \textbf{$K$-level SLOPE} as a convex optimization problem, which is an important sub-class of SLOPE (which we term as the $p$-level SLOPE) and only have $(2K-1)\ll p$ hyperparameters. We further develop a projected gradient descent to search the optimal $K$-level SLOPE penalty under the Gaussian random data matrix. Interestingly, our experiments demonstrate that even the simplest 2-level SLOPE may give amazing improvement over the LASSO and be comparable to $p$-level SLOPE, suggesting its usefulness for practitioners.
|
Zhiqi Bu · Rachel Wu 🔗 |
-
|
Graph Attention Retrospective
(
Poster (GatherTown)
)
link »
Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular type of models is graph attention networks. These models were introduced to allow a node to aggregate information from the features of neighbor nodes in a non-uniform way in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we study theoretically this expected behaviour of graph attention networks.We prove multiple results on the performance of the graph attention mechanism for the problem of node classification for a contextual stochastic block model. Here the features of the nodes are obtained from a mixture of Gaussians and the edges from a stochastic block model where the features and the edges are coupled in a natural way. First, we show that in an |
Kimon Fountoulakis · Amit Levi · Shenghao Yang · Aseem Baranwal · Aukosh Jagannath 🔗 |
-
|
Learning heuristics for A*
(
Poster (GatherTown)
)
link »
Path finding in graphs is one of the most studied classes of problems in computer science. In this context, search algorithms are often extended with heuristics for a more efficient search of target nodes. In this work we combine recent advancements in Neural Algorithmic Reasoning to learn efficient heuristic functions for path finding problems on graphs. At training time, we exploit multi-task learning to learn jointly the Dijkstra's algorithm and a consistent heuristic function for the A* search algorithm. At inference time, we plug our learnt heuristics into the A* algorithm. Results show that running A* over the learnt heuristics value can greatly speed up target node searching compared to Dijkstra, while still finding minimal-cost paths. |
Danilo Numeroso · Davide Bacciu · Petar Veličković 🔗 |
-
|
Meta Mirror Descent: Optimiser Learning for Fast Convergence
(
Poster (GatherTown)
)
link »
Optimisers are an essential component for training machine learning models, and their design influences learning speed and generalisation. Several studies have attempted to learn more effective gradient-descent optimisers via solving a bi-level optimisation problem where generalisation error is minimised with respect to optimiser parameters. However, most existing optimiser learning methods are intuitively motivated, without clear theoretical support. We take a different perspective starting from mirror descent rather than gradient descent, and meta-learning the corresponding Bregman divergence. Within this paradigm, we formalise a novel meta-learning objective of minimising the regret bound of learning. The resulting framework, termed Meta Mirror Descent (MetaMD), learns to accelerate optimisation speed. Unlike many meta-learned optimisers, it also supports convergence and generalisation guarantees and uniquely does so without requiring validation data. We evaluate our framework on a variety of tasks and architectures in terms of convergence rate and generalisation error and demonstrate strong performance. |
Boyan Gao · Henry Gouk · Hae Beom Lee · Timothy Hospedales 🔗 |
-
|
Graph Neural Networks are Dynamic Programmers
(
Poster (GatherTown)
)
link »
Recent advances in neural algorithmic reasoning with graph neural networks (GNNs) are propped up by the notion of algorithmic alignment. Broadly, a neural network will be better at learning to execute a reasoning task (in terms of sample complexity) if its individual components align well with the target algorithm. Specifically, GNNs are claimed to align with dynamic programming (DP), a general problem-solving strategy which expresses many polynomial-time algorithms. However, has this alignment truly been demonstrated and theoretically quantified? Here we show, using methods from category theory and abstract algebra, that there exists an intricate connection between GNNs and DP, going well beyond the initial observations over individual algorithms such as Bellman-Ford. Exposing this connection, we easily verify several prior findings in the literature, and hope it will serve as a foundation for building stronger algorithmically aligned GNNs. |
Andrew Dudzik · Petar Veličković 🔗 |
-
|
The CLRS Algorithmic Reasoning Benchmark
(
Poster (GatherTown)
)
link »
Learning representations of algorithms is an emerging area of machine learning, seeking to bridge concepts from neural networks with classical algorithms. Several important works have investigated whether neural networks can effectively reason like algorithms, typically by learning to execute them. The common trend in the area, however, is to generate targeted kinds of algorithmic data to evaluate specific hypotheses, making results hard to transfer across publications, and increasing the barrier of entry. To consolidate progress and work towards unified evaluation, we propose the CLRS Algorithmic Reasoning Benchmark, covering classical algorithms from the Introduction to Algorithms textbook. Our benchmark spans a variety of algorithmic reasoning procedures, including sorting, searching, dynamic programming, graph algorithms, string algorithms and geometric algorithms. We perform extensive experiments to demonstrate how several popular algorithmic reasoning baselines perform on these tasks, and consequently, highlight links to several open challenges. |
Petar Veličković · Adria Puigdomenech Badia · David Budden · Razvan Pascanu · Andrea Banino · Misha Dashevskiy · Raia Hadsell · Charles Blundell 🔗 |
-
|
Continuous Neural Algorithmic Planners
(
Poster (GatherTown)
)
link »
Planning is an important aspect of successful agency, especially as tasks get more combinatorially challenging. This intuition is applied in reinforcement learning, bringing about algorithms, such as value iteration, that allow us to plan and obtain optimal policies, if given the necessary information about the environment. Implicit planning eliminates the need for this privileged information, by combining learned world models with model-free reinforcement learning. A recent implicit planner, XLVIN, allows reaping the benefits of modern representation learning while still maintaining alignment to the value iteration algorithm; however, it only supports discrete action spaces, and is hence nontrivially applicable on most tasks of real-world interest. We expand XLVIN to continuous action spaces by discretising the action space, and evaluating several selective expansion policies. Our proposal, CNAP, demonstrates how neural algorithmic reasoning can make measurable impact in higher-dimensional continuous control settings, such as MuJoCo, bringing gains in low-data settings and outperforming model-free baselines. |
Yu He · Petar Veličković · Pietro Lio · Andreea Deac 🔗 |