About me

I’m a PhD researcher in the Sequential Decision Making group at the Delft University of Technology supervised by Matthijs Spaan and Wendelin Böhmer. I do research in reinforcement learning with a focus on developing RL agents that can generalise to new scenarios. I have investigated several ways of improving generalisation performance, through exploring more of the training environemnts, using ensembles and distillation after training, and data augmentation in an offline RL setting. I am currently doing an Applied Science internship at Wayve in London, where I, among other things, investigate data augmentation techniques to improve real-world autonomous driving performance.

Generally, I am interested in many things. At the moment this includes generalisation, adaptation, continual learning, causality, physics, the scientific method, software engineering, playing guitar, singing, painting and collecting fossils.

News

May 06, 2026	Our paper “Training on Irrelevant States Implies Data Augmentation: Generalization in Contextual MDPs” got accepted at RLC 2026!
Jan 01, 2026	Started an internship at Wayve in London.
Sep 18, 2025	Our paper “How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning” got accepted at NeurIPS 2025!
Jul 10, 2025	Presented our work Exploration Implies Data Augmentation in Cathy Wu’s lab at MIT

Publications

- Generalization in Offline RL: The Structure of Pessimism Matters More than How Pessimistic You Are
  
  Max Weltevrede, Matthijs T. J. Spaan, and Wendelin Böhmer
  
  Preprint, May 2026
  
  Abs
  
  Offline reinforcement learning (RL) suffers from training instability due to overestimation bias and distribution shifts, and pessimism (and other forms of conservatism) has been largely adopted to overcome this. However, being overly pessimistic has been associated with hindering generalization to out-of-distribution actions. In this paper, we demonstrate that being overly pessimistic does not inherently prevent optimality when generalizing to new contexts in the contextual Markov decision process (CMDP) framework. Instead, we argue that generalization is fundamentally about learning the structures underlying the invariances of the optimal solution, often formalized as symmetries. Therefore, it is not about how pessimistic the agent is, but whether the pessimism violates these symmetries. We prove that a mildly pessimistic, but non-symmetric value function can generalize worse than an overly pessimistic, but symmetric one. In offline RL, the structure of the pessimism is determined by the structure of the offline dataset. As such, enforcing a symmetric value function might require techniques such as data augmentation (DA), motivating its importance for offline RL in particular. Inspired by our theoretical results, we argue that DA can best be applied through a consistency loss during policy extraction, rather than the common practice of standard offline training on an augmented dataset. This is empirically demonstrated on a rotationally symmetric reacher environment for two popular offline RL approaches, IQL and CQL.
- Training on Irrelevant States Implies Data Augmentation: Generalization in Contextual MDPs
  
  Max Weltevrede, Caroline Horsch, Matthijs T. J. Spaan, and Wendelin Böhmer
  
  Reinforcement Learning Conference (RLC 2026), May 2026
  
  Abs arXiv
  
  In the zero-shot policy transfer (ZSPT) setting for contextual Markov decision processes (CMDP), agents train on a fixed, finite set of contexts and must generalize to new ones. Recent work has demonstrated that training on additional states, even if they are irrelevant for solving the current context, can improve generalization to unseen contexts. In this paper, we demonstrate that training on these states can indeed improve generalization, but can come at a cost of reducing the accuracy of the learned value function, which should hurt generalization. We hypothesize and demonstrate that increasing the agent’s coverage by training on these additional states while also increasing the accuracy improves generalization even further. Inspired by this, we propose a simple approach Explore-Go that leverages existing pure exploration strategies in a new way: by introducing a pure exploration phase at the start of each training episode. Unlike previous approaches that apply exploration strategies for the purpose of improving generalization, our approach can be combined with both on- and off-policy algorithms. We demonstrate the effectiveness of Explore-Go when combined with several popular algorithms and show an increase in test-time performance across several generalization benchmarks, even partially observable ones. With this, we hope to provide practitioners with a simple modification that can significantly improve the generalization of their agents.
- Sparse Masked Attention Policies for Reliable Generalization
  
  Caroline Horsch, Laurens Engwegen, Max Weltevrede, Matthijs T. J. Spaan, and Wendelin Böhmer
  
  Preprint, Under Review, Feb 2026
  
  Abs arXiv
  
  In reinforcement learning, abstraction methods that remove unnecessary information from the observation are commonly used to learn policies which generalize better to unseen tasks. However, these methods often overlook a crucial weakness: the function which extracts the reduced-information representation has unknown generalization ability in unseen observations. In this paper, we address this problem by presenting an information removal method which more reliably generalizes to new states. We accomplish this by using a learned masking function which operates on, and is integrated with, the attention weights within an attention-based policy network. We demonstrate that our method significantly improves policy generalization to unseen tasks in the Procgen benchmark compared to standard PPO and masking approaches.
- Universal Value-Function Uncertainties
  
  Moritz A. Zanger, Max Weltevrede, Yaniv Oren, Pascal R. Van der Vaart, Caroline Horsch, Wendelin Böhmer, and Matthijs T. J. Spaan
  
  International Conference on Learning Representations (ICLR 2026), Feb 2026
  
  Abs arXiv
  
  Estimating epistemic uncertainty in value functions is a crucial challenge for many aspects of reinforcement learning (RL), including efficient exploration, safe decision-making, and offline RL. While deep ensembles provide a robust method for quantifying value uncertainty, they come with significant computational overhead. Single-model methods, while computationally favorable, often rely on heuristics and typically require additional propagation mechanisms for myopic uncertainty estimates. In this work we introduce universal value-function uncertainties (UVU), which, similar in spirit to random network distillation (RND), quantify uncertainty as squared prediction errors between an online learner and a fixed, randomly initialized target network. Unlike RND, UVU errors reflect policy-conditional value uncertainty, incorporating the future uncertainties any given policy may encounter. This is due to the training procedure employed in UVU: the online network is trained using temporal difference learning with a synthetic reward derived from the fixed, randomly initialized target network. We provide an extensive theoretical analysis of our approach using neural tangent kernel (NTK) theory and show that in the limit of infinite network width, UVU errors are exactly equivalent to the variance of an ensemble of independent universal value functions. Empirically, we show that UVU achieves equal performance to large ensembles on challenging multi-task offline RL settings, while offering simplicity and substantial computational savings.
- How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning
  
  Max Weltevrede, Moritz A. Zanger, Matthijs T. J. Spaan, and Wendelin Böhmer
  
  Conference on Neural Information Processing Systems (NeurIPS 2025), May 2025
  
  Abs arXiv
  
  In the zero-shot policy transfer setting in reinforcement learning, the goal is to train an agent on a fixed set of training environments so that it can generalise to similar, but unseen, testing environments. Previous work has shown that policy distillation after training can sometimes produce a policy that outperforms the original in the testing environments. However, it is not yet entirely clear why that is, or what data should be used to distil the policy. In this paper, we prove, under certain assumptions, a generalisation bound for policy distillation after training. The theory provides two practical insights: for improved generalisation, you should 1) train an ensemble of distilled policies, and 2) distil it on as much data from the training environments as possible. We empirically verify that these insights hold in more general settings, when the assumptions required for the theory no longer hold. Finally, we demonstrate that an ensemble of policies distilled on a diverse dataset can generalise significantly better than the original agent.
- Explore-Go: Leveraging Exploration for Generalisation in Deep Reinforcement Learning
  
  Max Weltevrede, Felix Kaubek, Matthijs T. J. Spaan, and Wendelin Böhmer
  
  Seventeenth European Workshop on Reinforcement Learning (EWRL), Sep 2024
  
  Abs arXiv
  
  One of the remaining challenges in reinforcement learning is to develop agents that can generalise to novel scenarios they might encounter once deployed. This challenge is often framed in a multi-task setting where agents train on a fixed set of tasks and have to generalise to new tasks. Recent work has shown that in this setting increased exploration during training can be leveraged to increase the generalisation performance of the agent. This makes sense when the states encountered during testing can actually be explored during training. In this paper, we provide intuition why exploration can also benefit generalisation to states that cannot be explicitly encountered during training. Additionally, we propose a novel method Explore-Go that exploits this intuition by increasing the number of states on which the agent trains. Explore-Go effectively increases the starting state distribution of the agent and as a result can be used in conjunction with most existing on-policy or off-policy reinforcement learning algorithms. We show empirically that our method can increase generalisation performance in an illustrative environment and on the Procgen benchmark.
- The Role of Diverse Replay for Generalisation in Reinforcement Learning
  
  Max Weltevrede, Matthijs T. J. Spaan, and Wendelin Böhmer
  
  Sixteenth European Workshop on Reinforcement Learning (EWRL), Aug 2023
  
  Abs arXiv
  
  In reinforcement learning (RL), key components of many algorithms are the exploration strategy and replay buffer. These strategies regulate what environment data is collected and trained on and have been extensively studied in the RL literature. In this paper, we investigate the impact of these components in the context of generalisation in multi-task RL. We investigate the hypothesis that collecting and training on more diverse data from the training environments will improve zero-shot generalisation to new tasks. We motivate mathematically and show empirically that generalisation to tasks that are "reachable” during training is improved by increasing the diversity of transitions in the replay buffer. Furthermore, we show empirically that this same strategy also shows improvement for generalisation to similar but "unreachable” tasks which could be due to improved generalisation of the learned latent representations.

Max Weltevrede

News

Publications