Dallas Card

Dallas Card

Assistant Professor, School of Information
University of Michigan

Email: dalc@umich.edu
GitHub, Twitter, Blog
Google Scholar, ORCiD

I am an assistant professor in the School of Information at the University of Michigan. Before that, I was a postdoctoral researcher in the Stanford NLP Group and the Stanford Data Science Institute. I received my Ph.D. from the Machine Learning Department at Carnegie Mellon University, where I was advised by Noah Smith.

My research centers on making machine learning more reliable and responsible, and on using machine learning and natural language processing to learn about society from text.


  • July 2023: I will be speaking about evaluation challenges at the MIDAS workshop on Generative AI for Research, July 25-26th
  • July 2023: I will be attending the CASMI workshop on Sociotechnical Approaches to Measurement and Validation for Safety in AI, July 18-19th
  • July 2023: I will be attending ACL 2023 in Toronto, July 9-14th, where I will be presenting a paper on Semantic Change Detection
  • June 2023: I will be attending FAccT 2023 in Chicago, June 12-15th
  • May 2023: I will be giving a keynote on May 16th at the MIDAS Forum on Building Ethical and Trustworthy AI
  • May 2023: I will be speaking on May 12th about ChatGPT at the Ann Arbor District Library with Rada Mihalcea!
  • May 2023: I will be presenting at the Cambridge Language Technology Lab seminar on May 4th.
  • March 2023: Congratulations to my PhD student Lavinia Dunagan on being awarded an NSF GRFP!!

Selected Publications

Correlation between human ratings and Scaled JSD for SemEval English

Substitution-based Semantic Change Detection using Contextual Embeddings
Dallas Card
In Proceedings of ACL, 2023.
Abstract Paper Code BibTeX

Measuring semantic change has thus far remained a task where methods using contextual embeddings have struggled to improve upon simpler techniques relying only on static word vectors. Moreover, many of the previously proposed approaches suffer from downsides related to scalability and ease of interpretation. We present a simplified approach to measuring semantic change using contextual embeddings, relying only on the most probable substitutes for masked terms. Not only is this approach directly interpretable, it is also far more efficient in terms of storage, achieves superior average performance across the most frequently cited datasets for this task, and allows for more nuanced investigation of change than is possible with static word vectors.

Correlation between GPT-3 quality sores and demographic factors

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Suchin Gururangan, Dallas Card, Sarah K. Dreier, Emily K. Gade, Leroy Z. Wang, Zeyu Wang, Luke Zettlemoyer, Noah A. Smith
In Proceedings of EMNLP, 2022.
Abstract Paper Data and Code BibTeX

Language models increasingly rely on massive web dumps for diverse text data. However, these sources are rife with undesirable content. As such, resources like Wikipedia, books, and newswire often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering. Using a new dataset of U.S. high school newspaper articles -- written by students from across the country -- we investigate whose language is preferred by the quality filter used for GPT-3. We find that newspapers from larger schools, located in wealthier, educated, and urban ZIP codes are more likely to be classified as high quality. We then demonstrate that the filter's measurement of quality is unaligned with other sensible metrics, such as factuality or literary acclaim. We argue that privileging any corpus as high quality entails a language ideology, and more care is needed to construct training corpora for language models, with better transparency and justification for the inclusion or exclusion of various texts.

Net tone of immigration speeches over time

Computational analysis of 140 years of US political speeches reveals more positive but increasingly polarized framing of immigration
Dallas Card, Serina Chang, Chris Becker, Julia Mendelsohn, Rob Voigt, Leah Boustan, Ran Abramitzky, Dan Jurafsky
In Proceedings of the National Academy of Sciences 119(31), 2022.
Abstract Paper Data and Code BibTeX

We classify and analyze 200,000 US congressional speeches and 5,000 presidential communications related to immigration from 1880 to the present. Despite the salience of antiimmigration rhetoric today, we find that political speech about immigration is now much more positive on average than in the past, with the shift largely taking place between World War II and the passage of the Immigration and Nationality Act in 1965. However, since the late 1970s, political parties have become increasingly polarized in their expressed attitudes toward immigration, such that Republican speeches today are as negative as the average congressional speech was in the 1920s, an era of strict immigration quotas. Using an approach based on contextual embeddings of text, we find that modern Republicans are significantly more likely to use language that is suggestive of metaphors long associated with immigration, such as "animals" and "cargo," and make greater use of frames like "crime" and "legality." The tone of speeches also differs strongly based on which nationalities are mentioned, with a striking similarity between how Mexican immigrants are framed today and how Chinese immigrants were framed during the era of Chinese exclusion in the late 19th century. Overall, despite more favorable attitudes toward immigrants and the formal elimination of race-based restrictions, nationality is still a major factor in how immigrants are spoken of in Congress.

The Values Encoded in Machine Learning Research

The Values Encoded in Machine Learning Research [Distinguished Paper Award]
Abeba Birhane, Pratyusha Kalluri, Dallas Card, William Agnew, Ravit Dotan, and Michelle Bao
In Proceedings of FAccT, 2022.
Abstract Paper Data and Code BibTeX

Machine learning currently exerts an outsized influence on the world, increasingly affecting institutional practices and impactedcommunities. It is therefore critical that we question vague conceptions of the field as value-neutral or universally beneficial, and investigate what specific values the field is advancing. In this paper, we first introduce a method and annotation scheme for studying the values encoded in documents such as research papers. Applying the scheme, we analyze 100 highly cited machine learning papers published at premier machine learning conferences, ICML and NeurIPS. We annotate key features of papers which reveal their values: their justification for their choice of project, which attributes of their project they uplift, their consideration of potential negative consequences, and their institutional affiliations and funding sources. We find that few of the papers justify how their project connects to a societal need (15%) and far fewer discuss negative potential (1%). Through line-by-line content analysis, we identify 59 values that are uplifted in ML research, and, of these, we find that the papers most frequently justify and assess themselves based on Performance, Generalization, Quantitative evidence, Efficiency, Building on past work, and Novelty. We present extensive textual evidence and identify key themes in the definitions and operationalization of these values. Notably, we find systematic textual evidence that these top values are being defined and applied with assumptions and implications generally supporting the centralization of power. Finally, we find increasingly close ties between these highly cited papers and tech companies and elite universities.

Modular Domain Adaptation

Modular Domain Adaptation
Junshen Chen, Dallas Card, Dan Jurafsky
In Findings of ACL, 2022.
Abstract Paper Code Blog Post BibTeX

Off-the-shelf models are widely used by computational social science researchers to measure properties of text, such as sentiment.However, without access to source data it is difficult to account for domain shift, which represents a threat to validity. Here, we treat domain adaptation as a modular process that involves separate model producers and model consumers, and show how they can independently cooperate to facilitate more accurate measurements of text. We introduce two lightweight techniques for this scenario, and demonstrate that they reliably increase out-of-domain accuracy on four multi-domain text classification datasets when used with linear and contextual embedding models. We conclude with recommendations for model producers and consumers, and release models and replication code to accompany this paper.

Problems with Cosine

Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words
Kaitlyn Zhou, Kawin Ethayarajh, Dallas Card, Dan Jurafsky
In Proceedings of ACL, 2022.
Abstract Paper BibTeX

Cosine similarity of contextual embeddings is used in many NLP tasks (e.g., QA, IR, MT) and metrics (e.g., BERTScore). Here, we uncover systematic ways in which word similarities estimated by cosine over BERT embeddings are understated and trace this effect to training data frequency. We find that relative to human judgements, cosine similarity underestimates the similarity of frequent words with other instances of the same word or other words across contexts, even after controlling for polysemy and other factors. We conjecture that this underestimation of similarity for high frequency words is due to differences in the representational geometry of high and low frequency words and provide a formal argument for the two-dimensional case.

On the Opportunities and Risks of Foundation Models

On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A. Hudson, Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, et al.
arXiv:2108.07258, 2021.
Abstract Paper BibTeX

AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.

Expected Validation Performance and Estimation of a Random Variable's Maximum

Expected Validation Performance and Estimation of a Random Variable's Maximum
Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, Noah A. Smith
In Findings of EMNLP, 2021.
Abstract Paper BibTeX

Research in NLP is often supported by experimental results, and improved reporting of such results can lead to better understanding and more reproducible science. In this paper we analyze three statistical estimators for expected validation performance, a tool used for reporting performance (e.g., accuracy) as a function of computational budget (e.g., number of hyperparameter tuning experiments). Where previous work analyzing such estimators focused on the bias, we also examine the variance and mean squared error (MSE). In both synthetic and realistic scenarios, we evaluate three estimators and find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias; the estimator with the smallest MSE strikes a balance between bias and variance, displaying a classic bias-variance tradeoff. We use expected validation performance to compare between different models, and analyze how frequently each estimator leads to drawing incorrect conclusions about which of two models performs best. We find that the two biased estimators lead to the fewest incorrect conclusions, which hints at the importance of minimizing variance and MSE.

Causal Effects of Linguistic Properties

Causal Effects of Linguistic Properties
Reid Pryzant, Dallas Card, Dan Jurafsky, Victor Veitch, and Dhanya Sridhar
In Proceedings of NAACL, 2021.
Abstract Paper BibTeX

We consider the problem of using observational data to estimate the causal effects of linguistic properties. For example, does writing a complaint politely lead to a faster response time? How much will a positive product review increase sales? This paper addresses two technical challenges related to the problem before developing a practical method. First, we formalize the causal quantity of interest as the effect of a writer's intent, and establish the assumptions necessary to identify this from observational data. Second, in practice, we only have access to noisy proxies for the linguistic properties of interest -- e.g., predictions from classifiers and lexicons. We propose an estimator for this setting and prove that its bias is bounded when we perform an adjustment for the text. Based on these results, we introduce TextCause, an algorithm for estimating causal effects of linguistic properties. The method leverages (1) distant supervision to improve the quality of noisy proxies, and (2) a pre-trained language model (BERT) to adjust for the text. We show that the proposed method outperforms related approaches when estimating the effect of Amazon review sentiment on semi-simulated sales figures. Finally, we present an applied case study investigating the effects of complaint politeness on bureaucratic response times.

With Little Power Comes Great Responsibility

With Little Power Comes Great Responsibility
Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky
In Proceedings of EMNLP, 2020.
Abstract Paper Code BibTeX

Despite its importance to experimental design, statistical power (the probability that, given a real effect, an experiment will reject the null hypothesis) has largely been ignored by the NLP community. Underpowered experiments make it more difficult to discern the difference between statistical noise and meaningful model improvements, and increase the chances of exaggerated findings. By meta-analyzing a set of existing NLP papers and datasets, we characterize typical power for a variety of settings and conclude that underpowered experiments are common in the NLP literature. In particular, for several tasks in the popular GLUE benchmark, small test sets mean that most attempted comparisons to state of the art models will not be adequately powered. Similarly, based on reasonable assumptions, we find that the most typical experimental design for human rating studies will be underpowered to detect small model differences, of the sort that are frequently studied. For machine translation, we find that typical test sets of 2000 sentences have approximately 75% power to detect differences of 1 BLEU point. To improve the situation going forward, we give an overview of best practices for power analysis in NLP and release a series of notebooks to assist with future power analyses.

Detecting Stance in Media On Global Warming

Detecting Stance in Media On Global Warming
Yiwei Luo, Dallas Card, and Dan Jurafsky
In Findings of EMNLP, 2020.
Abstract Paper Code BibTeX

Citing opinions is a powerful yet understudied strategy in argumentation. For example, an environmental activist might say, "Leading scientists agree that global warming is a serious concern," framing a clause which affirms their own stance ("that global warming is serious") as an opinion endorsed ("[scientists] agree") by a reputable source ("leading"). In contrast, a global warming denier might frame the same clause as the opinion of an untrustworthy source with a predicate connoting doubt: "Mistaken scientists claim [...]." Our work studies opinion-framing in the global warming (GW) debate, an increasingly partisan issue that has received little attention in NLP. We introduce Global Warming Stance Dataset (GWSD), a dataset of stance-labeled GW sentences, and train a BERT classifier to study novel aspects of argumentation in how different sides of a debate represent their own and each other's opinions. From 56K news articles, we find that similar linguistic devices for self-affirming and opponent-doubting discourse are used across GW-accepting and skeptic media, though GW-skeptical media shows more opponent-doubt. We also find that authors often characterize sources as hypocritical, by ascribing opinions expressing the author's own view to source entities known to publicly endorse the opposing view. We release our stance dataset, model, and lexicons of framing devices for future work on opinion-framing and the automatic detection of GW stance.

Explain like I am a Scientist: The Linguistic Barriers of Entry to r/science

Explain like I am a Scientist: The Linguistic Barriers of Entry to r/science
Tal August, Dallas Card, Gary Hsieh, Noah A. Smith, and Katharina Reinecke
In Human Factors in Computing Systems (CHI), 2020.
Abstract Paper BibTeX

As an online community for discussing research findings, r/science has the potential to contribute to science outreach and communication with a broad audience. Yet previous work suggests that most of the active contributors on r/science are science-educated people rather than a lay general public. One potential reason is that r/science contributors might use a different, more specialized language than used in other subreddits. To investigate this possibility, we analyzed the language used in more than 68 million posts and comments from 12 subreddits from 2018. We show that r/science uses a specialized language that is distinct from other subreddits. Transient (newer) authors of posts and comments on r/science use less specialized language than more frequent authors, and those that leave the community use less specialized language than those that stay, even when comparing their first comments. These findings suggest that the specialized language used in r/science has a gatekeeping effect, preventing participation by people whose language does not align with that used in r/science. By characterizing r/science's specialized language, we contribute guidelines and tools for increasing the number of contributors in r/science.

On Consequentialism and Fairness
Dallas Card and Noah A. Smith
Frontiers in Artificial Intelligence, 2020.
Abstract Paper BibTeX

Recent work on fairness in machine learning has primarily emphasized how to define, quantify, and encourage “fair” outcomes. Less attention has been paid, however, to the ethical foundations which underlie such efforts. Among the ethical perspectives that should be taken into consideration is consequentialism, the position that, roughly speaking, outcomes are all that matter. Although consequentialism is not free from difficulties, and although it does not necessarily provide a tractable way of choosing actions (because of the combined problems of uncertainty, subjectivity, and aggregation), it nevertheless provides a powerful foundation from which to critique the existing literature on machine learning fairness. Moreover, it brings to the fore some of the tradeoffs involved, including the problem of who counts, the pros and cons of using a policy, and the relative value of the distant future. In this paper we provide a consequentialist critique of common definitions of fairness within machine learning, as well as a machine learning perspective on consequentialism. We conclude with a broader discussion of the issues of learning and randomization, which have important implications for the ethics of automated decision making systems.

Show Your Work Figure

Show Your Work: Improved Reporting of Experimental Results
Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith
In Proceedings of EMNLP, 2019.
Abstract Paper Code Press BibTeX

Research in natural language processing proceeds, in part, by demonstrating that new models achieve superior performance (e.g., accuracy) on held-out test data, compared to previous results. In this paper, we demonstrate that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best. We argue for reporting additional details, especially performance on validation data obtained during model development. We present a novel technique for doing so: expected validation performance of the best-found model as a function of computation budget (i.e., the number of hyperparameter search trials or the overall training time). Using our approach, we find multiple recent model comparisons where authors would have reached a different conclusion if they had used more (or less) computation. Our approach also allows us to estimate the amount of computation required to obtain a given accuracy; applying it to several recently published results yields massive variation across papers, from hours to weeks. We conclude with a set of best practices for reporting experimental results which allow for robust future comparisons, and provide code to allow researchers to use our technique.


Variational Pretraining for Semi-supervised Text Classification
Suchin Gururangan, Tam Dang, Dallas Card, and Noah A. Smith
In Proceedings of ACL, 2019.
Abstract Paper Code BibTeX

We introduce VAMPIRE, a lightweight pretraining framework for effective text classification when data and computing resources are limited. We pretrain a unigram document model as a variational autoencoder on in-domain, unlabeled data and use its internal states as features in a downstream classifier. Empirically, we show the relative strength of VAMPIRE against computationally expensive contextual embeddings and other popular semi-supervised baselines under low resource settings. We also find that fine-tuning to in-domain data is crucial to achieving decent performance from contextual embeddings when working with limited supervision. We accompany this paper with code to pretrain and use VAMPIRE embeddings in downstream tasks.

Hatespeech Figure

The Risk of Racial Bias in Hate Speech Detection
Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A. Smith
In Proceedings of ACL, 2019.
Abstract Paper Press BibTeX

We investigate how annotators' insensitivity to differences in dialect can lead to racial bias in automatic hate speech detection models, potentially amplifying harm against minority populations. We first uncover unexpected correlations between surface markers of African American English (AAE) and ratings of toxicity in several widely-used hate speech datasets. Then, we show that models trained on these corpora acquire and propagate these biases, such that AAE tweets and tweets by self-identified African Americans are up to two times more likely to be labelled as offensive compared to others. Finally, we propose dialect and race priming as ways to reduce the racial bias in annotation, showing that when annotators are made explicitly aware of an AAE tweet's dialect they are significantly less likely to label the tweet as offensive.

DWAC Figure

Deep Weighted Averaging Classifiers
Dallas Card, Michael Zhang, and Noah A. Smith
In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (ACM FAT*), 2019.
Abstract Paper Code Blog Post BibTeX

Recent advances in deep learning have achieved impressive gains in classification accuracy on a variety of types of data, including images and text. Despite these gains, however, concerns have been raised about the calibration, robustness, and interpretability of these models. In this paper we propose a simple way to modify any conventional deep architecture to automatically provide more transparent explanations for classification decisions, as well as an intuitive notion of the credibility of each prediction. Specifically, we draw on ideas from nonparametric kernel regression, and propose to predict labels based on a weighted sum of training instances, where the weights are determined by distance in a learned instance-embedding space. Working within the framework of conformal methods, we propose a new measure of nonconformity suggested by our model, and experimentally validate the accompanying theoretical expectations, demonstrating improved transparency, controlled error rates, and robustness to out-of-domain data, without compromising on accuracy or calibration.

Scholar Figure

Neural Models for Documents with Metadata
Dallas Card, Chenhao Tan, and Noah A. Smith
In Proceedings of ACL, 2018.
Abstract Paper Code Tutorial BibTeX

Most real-world document collections involve various types of metadata, such as author, source, and date, and yet the most commonly-used approaches to modeling text corpora ignore this information. While specialized models have been developed for particular applications, few are widely used in practice, as customization typically requires derivation of a custom inference algorithm. In this paper, we build on recent advances in variational inference methods and propose a general neural framework, based on topic models, to enable flexible incorporation of metadata and allow for rapid exploration of alternative models. Our approach achieves strong performance, with a manageable tradeoff between perplexity, coherence, and sparsity. Finally, we demonstrate the potential of our framework through an exploration of a corpus of articles about US immigration.

Proportions Figure

The Importance of Calibration for Estimating Proportions from Annotations
Dallas Card, and Noah A. Smith
In Proceedings of NAACL, 2018.
Abstract Paper Code BibTeX

Estimating label proportions in a target corpus is a type of measurement that is useful for answering certain types of social-scientific questions. While past work has described a number of relevant approaches, nearly all are based on an assumption which we argue is invalid for many problems, particularly when dealing with human annotations. In this paper, we identify and differentiate between two relevant data generating scenarios (intrinsic vs. extrinsic labels), introduce a simple but novel method which emphasizes the importance of calibration, and then analyze and experimentally validate the appropriateness of various methods for each of the two scenarios.

Ideas Figure

Friendships, Rivalries, and Trysts: Characterizing Relations between Ideas in Texts
Chenhao Tan, Dallas Card, and Noah A. Smith
In Proceedings of ACL, 2017.
Abstract Paper Blog Post BibTeX

Understanding how ideas relate to each other is a fundamental question in many domains, ranging from intellectual history to public communication. Because ideas are naturally embedded in texts, we propose the first framework to systematically characterize the relations between ideas based on their occurrence in a corpus of documents, independent of how these ideas are represented. Combining two statistics - cooccurrence within documents and prevalence correlation over time - our approach reveals a number of different ways in which ideas can cooperate and compete. For instance, two ideas can closely track each other's prevalence over time, and yet rarely cooccur, almost like a "cold war" scenario. We observe that pairwise cooccurrence and prevalence correlation exhibit different distributions. We further demonstrate that our approach is able to uncover intriguing relations between ideas through in-depth case studies on news articles and research papers.

Personas Figure

Analyzing Framing through the Casts of Characters in the News
Dallas Card, Justin H. Gross, Amber E. Boydstun, and Noah A. Smith
In Proceedings of EMNLP, 2016.
Abstract Paper BibTeX

We present an unsupervised model for the discovery and clustering of latent "personas" (characterizations of entities). Our model simultaneously clusters documents featuring similar collections of personas. We evaluate this model on a collection of news articles about immigration, showing that personas help predict the coarse-grained framing annotations in the Media Frames Corpus. We also introduce automated model selection as a fair and robust form of feature evaluation.

Media Frames Corpus Figure

The Media Frames Corpus: Annotations of Frames Across Issues
Dallas Card, Amber E. Boydstun, Justin H. Gross, Philip Resnik, and Noah A. Smith
In Proceedings of ACL, 2015.
Abstract Paper Data BibTeX

We describe the first version of the Media Frames Corpus: several thousand news articles on three policy issues, annotated in terms of media framing. We motivate framing as a phenomenon of study for computational linguistics and describe our annotation process.

Media Coverage

About me

I'm originally from Winnipeg, but I have also lived in Toronto, Waterloo, Halifax, Sydney, Kampala, Pittsburgh, Seattle, Palo Alto, and now Ann Arbor!

I am an occasional guest on The Reality Check podcast! You can hear me in episodes #466 (biased algorithms), #382 (deep learning), #362 (Simpson's paradox), and #227 (fMRI and vegetative states).

GitHub Icon Twitter Icon Google Scholar Icon