Extrapolation to unseen domains: from theory to applications

Extrapolation to unseen domains: from theory to applications

Monday, April 22nd, 2024, 8:00 PT / 11:00 ET / 17:00 CET

3rd joint webinar of the IMS New Researchers Group, Young Data Science Researcher Seminar Zürich and the YoungStatS Project.

When & Where:

Monday, April 22nd, 2024, 8:00 PT / 11:00 ET / 17:00 CET
Online, via Zoom. The registration form is available here.

Speakers:

Max Simchowitz, Robot Locomotion Group, MIT: “Statistical Learning under Heterogeneous Distribution Shift”

Abstract: What makes a trained predictor, e.g. neural network, more or less susceptible to performance degradation under distribution shift? Spurious correlation, lack of diversity in the training data, and brittleness of the trained model are all possible culprits. In this talk, we will investigate a less well-studied factor: that of the statistical complexity of the individual features themselves. We will show that, for a very general class of predictors with a certain additive structure, empirical risk minimization is less sensitive to distribution shifts in “simple features” than “complex” ones, where simplicity/complexity are measured in terms of natural statistical quantities. We demonstrate that this arises because standard ERM learns the dependence on the “simpler” feature more quickly, whilst avoiding the risk of overfitting to more “complex” features. We will conclude by drawing connections to the orthogonal machine learning literature, and validating our theory on various experimental domains (even those in which the additivity assumption fails to hold).

Mohammad Lotfollahi, Wellcome Sanger Institute, University of Cambridge: “Generative machine learning to model cellular perturbations”

Abstract: The field of cellular biology has long sought to understand the intricate mechanisms that govern cellular responses to various perturbations, be they chemical, physical, or biological. Traditional experimental approaches, while invaluable, often face limitations in scalability and throughput, especially when exploring the vast combinatorial space of potential cellular states. Enter generative machine learning that has shown exceptional promise in modeling complex biological systems. This talk will highlight recent successes, address the challenges and limitations of current models, and discuss the future direction of this exciting interdisciplinary field. Through examples of practical applications, we will illustrate the transformative potential of generative ML in advancing our understanding of cellular perturbations and in shaping the future of biomedical research.

Zhijing Jin, Max Planck Institute and ETH Zürich: “A Paradigm Shift in Addressing Distribution Shifts: Insights from Large Language Models”

Abstract: Traditionally, the challenge of distribution shifts - where the training data distribution differs from the test data distribution - has been a central concern in statistical learning and model generalization. Traditional methods have primarily focused on techniques such as domain adaptation, and transfer learning. However, the rise of large language models (LLMs) such as ChatGPT has ushered in a novel empirical success, triggering a significant “shift” in problem formulation and approach for traditional distribution shift problems. In this talk, I will start with two formulations for LLMs: (1) the engineering heuristics aimed at transforming “out-of-distribution” (OOD) problems into “in-distribution” scenarios, which is further accompanied by (2) the hypothesized “emergence of intelligence” through massive scaling of data and model parameters, which challenges our traditional views on distribution shifts. I will sequentially examine these aspects, first by presenting behavioral tests of these models’ generalization capabilities across unseen data, and then by conducting intrinsic checks to uncover the mechanisms LLMs learned. This talk seeks to provoke thoughts on several questions: Do the strategies of “making OOD problem IID” and facilitating the “emergence of intelligence” by scaling, truly stand up to scientific scrutiny? Furthermore, what do these developments imply for the field of statistical learning and the broader evolution of AI.

Discussant: Nicolai Meinshausen, ETH Zürich

YoungStatS project of the Young Statisticians Europe initiative (FENStatS) is supported by the Bernoulli Society for Mathematical Statistics and Probability and the Institute of Mathematical Statistics (IMS).

If you missed this webinar, you can watch the recording on our YouTube channel.