Avoiding bias when inferring race using name-based approaches

Racial disparity in academia is a widely acknowledged problem. The quantitative understanding of racial-based systemic inequalities  is  an  important  step  towards  a  more  equitable research  system.  However,  few large-scale analyses have been performed on this topic, mostly because of the lack of robust race-disambiguation algorithms. Identifying author information does not generally include the author’s race.Therefore, an algorithm needs to be employed, using known information about authors, i.e., their names, to infer their perceived race. Nevertheless, as any other algorithm, the process of racial inference can generate biases if it is not carefully considered. When the research is focused on the understanding of racial-based inequalities, such biases undermine the objectives of the investigation  and  may  perpetuate inequities.  The  goal  of  this  article is  to  assess  the  biases  introduced  by  the  different  approaches used  name-based  racial inference.  We  use  information  from  US census  and  mortgage  applications to infer the race of US author names in the Web of Science. We estimate the effects of using given and family names, thresholds or continuous distributions, and imputation. Our results demonstrate that the validity of name-based inference varies by race and ethnicity and that threshold approaches underestimate Black authors and overestimate White authors. We conclude with recommendations to avoid potential biases. This article fills an important research gap that will allow more systematic and unbiased studies on racial disparity in science

