Not Logged In

A machine learning approach to predict ethnicity using personal name and census location in Canada

Full Text: pone.0241239.pdf PDF

Background Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. Methods Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. Results The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68–95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63–67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). Conclusions The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by specific ethnic categories.

Citation

K. Wong, O. Zaiane, F. Davis, Y. Yasui. "A machine learning approach to predict ethnicity using personal name and census location in Canada ". PLoS One, 15(11), pp e0241239, November 2020.

Keywords:  
Category: In Journal
Web Links: PLoS ONE

BibTeX

@article{Wong+al:PLoSONE20,
  author = {Kai On Wong and Osmar R. Zaiane and Faith G. Davis and Yutaka
    Yasui},
  title = {A machine learning approach to predict ethnicity using personal name
    and census location in Canada },
  Volume = "15",
  Number = "11",
  Pages = {e0241239},
  journal = {PLoS One},
  year = 2020,
}

Last Updated: January 19, 2021
Submitted by Sabina P

University of Alberta Logo AICML Logo