About me

I’m a PhD researcher in the Sequential Decision Making group at the Delft University of Technology supervised by Matthijs Spaan and Wendelin Böhmer. I do research in reinforcement learning with a focus on developing RL agents that can generalise to new scenarios. Currently, I investigating the role of exploration for improving generalisation performance, as well as the zero-shot generalisation capabilities of offline RL agents.

Generally, I am interested in many things. At the moment this includes generalisation, adaptation, continual learning, causality, physics, the scientific method, software engineering, playing guitar, singing, painting and collecting fossils.

News

Sep 18, 2025	Our paper “How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning” got accepted at NeurIPS 2025!
Jul 10, 2025	Presented our work Exploration Implies Data Augmentation in Cathy Wu’s lab at MIT

Publications

- How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning
  
  Max Weltevrede, Moritz A. Zanger, Matthijs T. J. Spaan, and Wendelin Böhmer
  
  Conference on Neural Information Processing Systems (NeurIPS 2025), May 2025
  
  Abs arXiv
  
  In the zero-shot policy transfer setting in reinforcement learning, the goal is to train an agent on a fixed set of training environments so that it can generalise to similar, but unseen, testing environments. Previous work has shown that policy distillation after training can sometimes produce a policy that outperforms the original in the testing environments. However, it is not yet entirely clear why that is, or what data should be used to distil the policy. In this paper, we prove, under certain assumptions, a generalisation bound for policy distillation after training. The theory provides two practical insights: for improved generalisation, you should 1) train an ensemble of distilled policies, and 2) distil it on as much data from the training environments as possible. We empirically verify that these insights hold in more general settings, when the assumptions required for the theory no longer hold. Finally, we demonstrate that an ensemble of policies distilled on a diverse dataset can generalise significantly better than the original agent.
- Universal Value-Function Uncertainties
  
  Moritz A. Zanger, Max Weltevrede, Yaniv Oren, Pascal R. Van der Vaart, Caroline Horsch, Wendelin Böhmer, and Matthijs T. J. Spaan
  
  Preprint. Under Review, May 2025
  
  Abs arXiv
  
  Estimating epistemic uncertainty in value functions is a crucial challenge for many aspects of reinforcement learning (RL), including efficient exploration, safe decision-making, and offline RL. While deep ensembles provide a robust method for quantifying value uncertainty, they come with significant computational overhead. Single-model methods, while computationally favorable, often rely on heuristics and typically require additional propagation mechanisms for myopic uncertainty estimates. In this work we introduce universal value-function uncertainties (UVU), which, similar in spirit to random network distillation (RND), quantify uncertainty as squared prediction errors between an online learner and a fixed, randomly initialized target network. Unlike RND, UVU errors reflect policy-conditional value uncertainty, incorporating the future uncertainties any given policy may encounter. This is due to the training procedure employed in UVU: the online network is trained using temporal difference learning with a synthetic reward derived from the fixed, randomly initialized target network. We provide an extensive theoretical analysis of our approach using neural tangent kernel (NTK) theory and show that in the limit of infinite network width, UVU errors are exactly equivalent to the variance of an ensemble of independent universal value functions. Empirically, we show that UVU achieves equal performance to large ensembles on challenging multi-task offline RL settings, while offering simplicity and substantial computational savings.
- Exploration Implies Data Augmentation: Reachability and Generalisation in Contextual MDPs
  
  Max Weltevrede, Caroline Horsch, Matthijs T. J. Spaan, and Wendelin Böhmer
  
  Preprint. Under Review, Feb 2025
  
  Abs arXiv
  
  In the zero-shot policy transfer (ZSPT) setting for contextual Markov decision processes (MDP), agents train on a fixed set of contexts and must generalise to new ones. Recent work has argued and demonstrated that increased exploration can improve this generalisation, by training on more states in the training contexts. In this paper, we demonstrate that training on more states can indeed improve generalisation, but can come at a cost of reducing the accuracy of the learned value function which should not benefit generalisation. We introduce reachability in the ZSPT setting to define which states/contexts require generalisation and explain why exploration can improve it. We hypothesise and demonstrate that using exploration to increase the agent’s coverage while also increasing the accuracy improves generalisation even more. Inspired by this, we propose a method Explore-Go that implements an exploration phase at the beginning of each episode, which can be combined with existing on- and off-policy RL algorithms and significantly improves generalisation even in partially observable MDPs. We demonstrate the effectiveness of Explore-Go when combined with several popular algorithms and show an increase in generalisation performance across several environments. With this, we hope to provide practitioners with a simple modification that can improve the generalisation of their agents.
- Explore-Go: Leveraging Exploration for Generalisation in Deep Reinforcement Learning
  
  Max Weltevrede, Felix Kaubek, Matthijs T. J. Spaan, and Wendelin Böhmer
  
  Seventeenth European Workshop on Reinforcement Learning (EWRL), Sep 2024
  
  Abs arXiv
  
  One of the remaining challenges in reinforcement learning is to develop agents that can generalise to novel scenarios they might encounter once deployed. This challenge is often framed in a multi-task setting where agents train on a fixed set of tasks and have to generalise to new tasks. Recent work has shown that in this setting increased exploration during training can be leveraged to increase the generalisation performance of the agent. This makes sense when the states encountered during testing can actually be explored during training. In this paper, we provide intuition why exploration can also benefit generalisation to states that cannot be explicitly encountered during training. Additionally, we propose a novel method Explore-Go that exploits this intuition by increasing the number of states on which the agent trains. Explore-Go effectively increases the starting state distribution of the agent and as a result can be used in conjunction with most existing on-policy or off-policy reinforcement learning algorithms. We show empirically that our method can increase generalisation performance in an illustrative environment and on the Procgen benchmark.
- The Role of Diverse Replay for Generalisation in Reinforcement Learning
  
  Max Weltevrede, Matthijs T. J. Spaan, and Wendelin Böhmer
  
  Sixteenth European Workshop on Reinforcement Learning (EWRL), Aug 2023
  
  Abs arXiv
  
  In reinforcement learning (RL), key components of many algorithms are the exploration strategy and replay buffer. These strategies regulate what environment data is collected and trained on and have been extensively studied in the RL literature. In this paper, we investigate the impact of these components in the context of generalisation in multi-task RL. We investigate the hypothesis that collecting and training on more diverse data from the training environments will improve zero-shot generalisation to new tasks. We motivate mathematically and show empirically that generalisation to tasks that are "reachable” during training is improved by increasing the diversity of transitions in the replay buffer. Furthermore, we show empirically that this same strategy also shows improvement for generalisation to similar but "unreachable” tasks which could be due to improved generalisation of the learned latent representations.

Max Weltevrede

News

Publications