Does a program or intervention result in improved outcomes?  This is the fundamental question for the generation of evidence. Randomized experiments provide the simplest and most rigorous means to answer to this question, yet real-world constraints may limit the practicality and complicate the implementation, of experiments. The STEPP Center endeavors to facilitate research designed to establish causal relations, by addressing challenges to the process of scientific inference.

Our work has focused on approaches to designing and analyzing experiments, techniques for analyzing data from quasi-experiments and observational studies, and methods for generalizing experimental results to policy-relevant populations.

To make these methods more accessible to researchers the Center provides tutorial papers, online tools and resources including working papers, seminars and short courses that train students and practitioners, and professional development institutes for established researchers.

Generating Evidence Research

Designing Experiments


Working Papers

Hedges, L. V. & Schauer, J. (2019). The design of replication studies. Evanston, IL: Northwestern University Institute for Policy Research Working Paper.

Katsanes, R. (2017). Design and analysis of trials for developing adaptive interventions in education. Evanston, IL: Northwestern University Institute for Policy Research Working Paper.



Bilimoria, K. Y., J. W. Chung, L. V. Hedges, A. R. Dahlke, R. Love, M. E. Cohen, J. Tarpley, J. Mellinger, D. M. Mahvi, R. R. Kelz, C. Y. Ko, D. B. Hoyt, and F. H. Lewis. (2016). Development of the flexibility in duty hour requirements for surgical trainees (FIRST) trial protocol: A national cluster-randomized trial of resident duty hour policies. Journal of the American Medical Association, Surgery, 151, 273-81. DOI: 10.1001/jamasurg.2015.4990.

Hedberg, E. C. & Hedges, L. V. (2014). Reference values of within-district intraclass correlations of academic achievement by district characteristics: Results from a meta-analysis of district-specific data. Evaluation Review, 38, 546-582. DOI: 10.1177/0193841X14554212.

Hedges, L. V. & Borenstein, M. (2014). Constrained optimal design in three and four level experiments. Journal of Educational and Behavioral Statistics, 39, 257-281. DOI: 10.3102/1076998614534897

Spybrook, J., Hedges, L. V., & Borenstein, M. (2014). Understanding statistical power in cluster randomized trials: Challenges posed by differences in notation and terminology, Journal of Research on Educational Effectiveness, 7, 384-406. DOI: 10.1080/19345747.2013.848963

Hedges, L. V. & Hedberg, E. C. (2013). Intraclass correlations and covariate outcome correlations for planning two- and three-level cluster-randomized experiments in education. Evaluation Review, 37, 13-57. DOI: 10.1177/0193841X14529126

Hedges, L. V., Hedberg, E. C., & Kuyper, A. (2012). The variance of intraclass correlations in three and four level models. Educational and Psychological Measurement, 72, 893-909. DOI: 10.1177/0013164412445193

Analyzing Experiments


Pustejovsky, J. & Tipton, E. (2018). Small sample methods for cluster-robust variance estimation and hypothesis testing in fixed effects models. Journal of Business and Economic Statistics, 36(4), 672-683. 3. DOI: 10.1080/07350015.2016.1247004.

Hedges, L. V. & Olkin, I. (2016). Overlap between treatment and control group distributions of an experiment as an effect size measure. Psychological Methods, 21, 61-68. DOI: 10.1037/met0000042

Hedges, L. V. & Citkowicz, M. (2015). Estimating effect size when there is clustering in one treatment group. Behavior Research Methods, 47, 1295-1308. DOI: 10.3758/s13428-014-0538-z.



{Coming soon}

Generalizing Experimental Results


Working Papers

Tipton, E. Sample selection in randomized trials with multiple target populations. Working paper.

Tipton, E. Beyond the ATE: Designing randomized trials to understand treatment effect heterogeneity. Working paper.



Tipton, E., Yeager, D., Schneider, B., & Iachan, R. Designing probability samples to identify sources of treatment effect heterogeneity. In P.J. Lavrakas (Ed.). Experimental methods in survey research: Techniques that combine random sampling with random assignment. New York, NY: Wiley.

Chung, J. W., Hedges, L. V., Bilimoria, K. Y. et al. (2018). The estimation of population average treatment effects in the FIRST trial: Application of a propensity score-based stratification approach. Health Services Research, 2567–2590. DOI: 10.1111/1475-6773.12752.

Tipton, E., & Hedges, L. V. (2017). The role of the sample in estimating and explaining treatment effect heterogeneity. Journal of Research on Educational Effectiveness, 10, 903-909. DOI:10.1080/19345747.2017.1364563.

Tipton, E. & Peck, L. (2017) A design-based approach to improve external validity in welfare policy evaluations. Evaluation Review (Special Issue: External Validity 1), 41(4), 326-356. DOI:10.1177/0193841X16655656.

Levay, K. E., Freese, J., & Druckman, J. N. (2016). The demographic and political composition of Mechanical Turk samples. Sage Open, 6(1), 2158244016636433. DOI:10.1177/2158244016636433.

Tipton, E., Hedges, L. V., Hallberg, K. & Chan, W. (2016). Implications of small samples for generalization: Adjustments and rules of thumb. Evaluation Review, 40, 1-34. DOI:10.1177/0193841X16655665.

Mullinix, K. J., Leeper, T. J., Druckman, J. N., & Freese, J. (2015). The generalizability of survey experiments. Journal of Experimental Political Science, 2(2), 109-138. DOI:10.1017/XPS.2015.19.

O’Muircheartaigh, C. & Hedges, L. V. (2014). Generalizing from experiments with non-representative samples. Journal of the Royal Statistical Society, Series C, 63, 195-210. DOI:10.1111/rssc.12037.

Tipton, E. (2014). How generalizable is your experiment? Comparing a sample and population through a generalizability index. Journal of Educational and Behavioral Statistics, 39(6), 478-501. DOI:10.3102/1076998614558486.

Tipton, E. (2013). Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics, 38, 239-66. DOI:10.3102/1076998612441947.