Research – Yair Goldberg’s Lab

Fighting COVID-19 by learning from data

To help policymakers set policy based on scientific methods, we use mathematical modeling and advanced statistical tools to study different aspects of the COVID-19 pandemic. Our research includes learning the susceptibility and infectivity of children and adolescents; the protection of vaccination and previous SARS-CoV-2 infection in preventing subsequent SARS-CoV-2 infection and other COVID-19 outcomes; and the effect of COVID-19 on different aspects of public health, such as suicide rate and natural abortion.

Measuring uncertainty of machine learning predictions

Data scientists are interested in answering questions such as how confident one is in a prediction, and whether a certain feature has a significant influence on the response variable. Drawing statistical inference for machine learning algorithms is difficult. We study methods for performing statistical inference for two common machine learning techniques: kernel machines and deep learning. We utilize Bayesian methods to quantify uncertainty, select hyper-parameter values, and to bound the generalization error. We propose novel PAC-Bayes generalization bounds which can be data-dependent.

Machine learning tools for missing and censored data

When using machine learning algorithms, it is often assumed that the data is complete. In real-life applications, this assumption is usually over-optimistic. Missingness can happen in many ways: some missing covariates, some missing responses, only a lower bound is given for the response (i.e., the response is right censored), observations are seen only if they crossed some level (i.e., left truncation), or a label is given only to a bag of observations. We develop machine learning tools that can handle missing data, using imputation, inverse probability weighting, and doubly-robust estimators. See, for example, the following works: kernel machines with right-censored data, current status data, missing responses; and finite-samle bounds for Cox regression and general right-censored data.