Gender and Geographic Origin Biases

栏目: IT技术 · 发布时间: 3年前

内容简介：Table of Contents1.2.

Small Business Classification By Name

Daniel Shapiro, PhD

Jun 14 ·18min read

Table of Contents

3.Dataset and Data Preparation

4.1Biased Model: A Model with Gender and Name Origin Biases

4.2Replacing Names in Training Data and at Inference Time

4.3Augmenting Training Data with Additional Gender Information

This article took a long time to prepare. Thank you to Professor Miodrag Bolic from the University of Ottawa for reviewing this article and providing valuable feedback.

1. Overview

Small business classificationmeans looking at a business name and putting a label onto it. This is an important task within many applications where you want to treat similar clients similarly, and it is something humans do all the time. For example, take the text of a company name as your input (e.g., “Daniel’s Roofing Inc”) and predict the type of the business (e.g., “Roofer”). There are business applications for this technology. For example, take a list of invoices and group them into holiday emails. What customers are similar to each other? You could use invoice amounts as an indicator of business type, but the name of the business is probably very helpful for you to understand what the business does with you, and therefore how they are related to what you do, and how you want to message them. In this article, small business classification using only the business name is the goal.

In this article, I developed a FastText model for predicting which one of 66 business types a company falls into, based only upon the business name. We will see later on that training a machine learning model on a dataset of small business names and their business type will introduce gender and geographic origin biases .

Two approaches to removing the observed bias are explored in this article:

Replacing given names with a placeholder token. The bias in the model was reduced by hiding given names from the model, but the bias reduction caused classification performance to drop.
Augmenting the training data with gender-swapped examples. For example, Bob’s Diner becomes Lisa’s Diner . Augmentation of the training data with gender-swapping samples proved less effective at bias reduction than the name hiding approach on the evaluated dataset.

2. Background

Our initial goal is to observe prediction biases related to the given name gender and geographic origin. Next, let’s work to remove the bias.

It is a common issue that a business knows the name of a small business counterparty , but does not know the archetype of the business. Small businesses present a unique challenge when performing customer segmentation, because small businesses may not be listed in databases or classification systems for large companies that include metadata such as the company type or classification code. Beyond the sales and marketing functions, customer segmentation can also be useful for customizing service delivery to clients. The type of a company can also be useful in forecasting sales, and in many other applications. The mix of various types of companies in a client list can also reveal the trends for client segments. Finally, classifying the type of a business may expose useful equity trading signals [1] [2]. The analysis of equity trading models can often include industry classification. For example, the Global Industry Classifications Standard (GICS) [3] was applied to assess trading model performance across various industries in [4] and [5].

Existing industry classification systems tend not to cover small businesses well. The criteria for inclusion in widely used industry classification systems is biased toward big businesses, and this bias seeps into training data (i.e., the names of the companies) as a bias for large entities. I imagine the bias is towards companies with high revenue and headcount. For example, a model trained on corporation names from the Russell 3000 index [6] will not be prepared properly at inference time to predict the type of business conducted at “Daniel’s Barber Shoppe”. Neither the business type (e.g., barber), nor the naming conventions (e.g., Shoppe) of small businesses are reflected in the names of larger corporations such as the members of the Russell 3000.

There are several human-curated industry classification systems including Standard Industrial Classification (SIC) [9], North American Industry Classification System (NAICS) [10], Global Industry Classification Standard (GICS) [3], Industry Classification Benchmark (ICB) [11], and others. One comparison of these classification schemes revealed that “the GICS advantage is consistent from year to year and is most pronounced among large firms” [12]. This further reinforces the point that these classification systems are better at categorizing large firms, rather than small businesses.

The labels within industry classification systems provide an opportunity for supervised learning. A complementary approach is to apply unsupervised learning to model the data. For example, clustering for industry classification has been applied in [13].

Measuring bias in a text classifier requires a validation dataset that is not a subset or split of the original dataset. The second dataset is required because without it, the testing and training data will likely contain the same data distribution, hiding bias from the assessment [14]. The purpose of the second dataset is to evaluate the classifier in out-of-distribution conditions it was not specifically trained for.

Gender bias in word embedding models could be reduced by augmenting the training data with gender-swapping samples [15], fine-tuning bias with a larger less biased dataset [16] or by other weight adjustment techniques [17], or removing gender-specific factors from the model [18]. Applying several of these techniques in combination may yield the best results [16]. In this work, augmenting training with gender-swapping samples, and hiding gender-specific factors from the model were evaluated.

Machine learning models are typically assessed using a top-k (also called top-n) accuracy, to show how precision and recall are affected by the specificity of the response [19]. In top-1 evaluation, an inference is counted as correct only if the highest probability label in the model output matches the true label. Similarly, in top-2 evaluation, an inference is counted as correct if either of the 2 highest probability labels in the model output matches the true label. The looser constraint typically leads to higher recall and poorer precision.

There is a potential downside to training a model to predict the type of a small business based upon the name of the business alone: introducing bias. Small business names often include gendered and geographically localized given names. This can lead a model to start associating given names with business types, which can lead to regulatory fines against corporations for discrimination. For example, without adjusting for this bias, a trained model may classify Daniel’s Gems as a Jeweller while classifying Sandy’s Gems as a Homecraft store. Further complicating matters, out-of-vocabulary names may also be biased by confusing the classifier into picking the wrong label, or by leading to a label associated with a name-based geographic stereotype.

Having motivated the need for an unbiased model that predicts the type of a small business from the name of the business, the approach to developing such a machine learning model is now presented.

The granularity of a model’s predictions affects the observed performance. Specifically, consider that using a random number generator to classify a small business into one of two general categories has a 50% chance of correctly classifying, whereas the same random number generator has a 1% chance of classifying a company into one of 100 more detailed categories. Clearly, the statistical penalty for specificity justifies making a general and a specific prediction from a small business classification model. For this reason, the model presented in this work outputs a high-level prediction regarding the overall business type, along with a more specific prediction regarding the exact small business type.

3. Dataset and Data Preparation

The small business classification labels used in this article were from the city of Vancouver’s Licence Bylaw 4450 [7]. The dataset used for model development was from the city’s open data website [8]. This dataset includes a variety of small businesses more suitable for small business classification than the commonly used industry classification systems.

The labels within the high-level small business type prediction are B2C , B2B , PUB , and B2BC . B2C stands for business to consumer, and so B2C companies sell to the consumer. Similarly, B2B companies sell to other companies on a business to business basis. B2BC companies sell to both business and consumers, and PUB is composed of government entities that are involved in delivering public services (e.g., schools, associations, and government entities). There is a mapping from the classes within the city of Vancouver’s Licence Bylaw 4450 [7] to the 4 aforementioned high-level categories, as explained in later sections of this work.

A manual data audit of a sample from the dataset revealed that about 30% of the labels are difficult for me (a human) to predict. Table 1 below shows a representative sample of 20 records, revealing some of the limits on model prediction performance. Clearly, small business type classification is challenging when the only information available is the company name. Some company names involve a one-of-a-kind play on words, street addresses, or are not descriptive. Others are either overly generic or overly specific. What is a reasonable performance expectation for this task? There are unsolvable edge cases and the usefulness of the classifier (is it good enough to apply as a customer segmentation signal) is subjective. The results are also very sensitive to the dataset used to train the machine learning model. In this article, I will report the results I observed, and leave it up to you to decide the qualitative acceptance criteria if you try to incorporate this stuff into your own work.

Table 1: Processed random sample from the company names dataset, indicating that some labels are difficult to correctly predict better than chance, even for humans.

The dataset preparation began with clarifying the business name and type for each business in the dataset. To select the most descriptive name from the BusinessName and BusinessTradeName fields in the dataset, the BusinessTradeName was used as the business name unless that field was blank, in which case the BusinessName field was used as the business name. Next, business names beginning and ending with round braces (indicating the name of the business is a person’s name) were removed from the dataset. The business type was extracted from the BusinessType field of the dataset. Next, duplicates from the set of business name and type pairs were dropped. These duplicates may exist because of events such as license renewals. Next, items with low-frequency business type labels were dropped from the dataset. Specifically, labels with less than 100 samples were dropped.

The general category “Office” in the dataset was dropped from the dataset as it was not descriptive for the purposes of building a business name classification model and seems to be a catchall. It may not be good practice to do this as there may be sneaky examples in that category, but on we go.

Similar categories were merged into broader class labels when the business names within the original categories were deemed too similar to separate. Specifically, several labels from the real-estate sector (Apartment House, Pre-1956 Dwelling, Non-profit Housing, Apartment House Strata, Secondary Suite — Permanent, Multiple Dwelling, Duplex, and One-Family Dwelling) were replaced with the general label Residential/Commercial. The labels Temp Liquor Licence Amendment, Liquor Establishment Standard, Liquor Establishment Extended, Liquor License Application, and Liquor Retail Store were replaced with the more general label Liquor Establishment. Items in the category U-Brew/U-Vin were merged with the Liquor Equipment class label. The three labels Laundry-Coin Operated Services, Laundry Depot, and Laundry (w/equipment) were replaced with the more general label Laundry. The labels Ltd Service Food Establishment, Restaurant Class 1, and Restaurant Class 2 were replaced with the more general label Restaurant. Labels Short-Term Rental, and Motel were combined with the category Hotel. Contractor — Special Trades was combined with the category Contractor. Although not all business and trade schools are private schools, the labels School (Business & Trade) were merged into the category School (Private), as the names were similar in both categories. Finally, The label Artist Live/Work Studio was merged into the category Studio.

Having completed the pre-processing, the 66 business types tracked in the dataset are presented in Table 2 below, along with the number of samples per label.

Table 2: Business types and their sample counts after data pre-processing

I didn’t clean up the code for this project, but I assume it will help you to have a look at some of the code I wrote to make this article. You can click here to see a gist of a lot of the stuff I used to wrangle the dataset, train models, and so forth.

4.1 Biased Model: A Model with Gender and Name Origin Biases

After the pre-processing described above, a FastText supervised learning model was trained on the dataset with a random weight initialization [20]. A hyperparameter search of learning rates and the number of training iterations resulted in the selection of 6 training iterations as the early stopping point, and a learning rate of 0.2 being selected. The embedding dimension width of the model was 100 dimensions, and the window size was set to 5. Model performance is recorded in Table 4 below, along with some spoilers for the next 2 sections.

This initial model was trained on business names from one geographic part of the world (Western Canada). The dataset included a variety of local given names and family names in the business name training data. The dataset is from an English language speaking province. These factors represent a strong potential for classification bias based upon the gender of a given name within a business name, and the geographic origin of names within business names.

We want to assess when the model changes it’s mind based only on something biased, like the gender associated with a name. Looking at Table 6, two approaches were applied to assess model bias. In the first approach, the model was assessed by constructing out-of-distribution data composed of a randomly selected given name followed by a random dictionary word with the first letter capitalized. For example, “Olivia’s Mirror” vs “Noah’s Mirror”. In the second approach, text from the model’s testing dataset (data held back from the model training) was appended to a randomly selected person’s name. For example, “Daniel’s Bob’s Grocery Store”. In order to examine bias within the model, the generation process controlled for the gender and geographic origin of the person’s name for both of these approaches. The lists of names by gender and geographic origin that were used for testing are shown in Table 5 below.

We noted earlier that this is a challenging dataset where a human (me) has trouble with at least 30% of the sampled data. We can see in Table 4 that the top-2 predictions for this initial model are okay, at about 73% chance that a given business type was labeled correctly out of 66 categories ( recall ), and a 36% chance that a label applied was the right one ( precision ). But we also see in Table 6 in the column for “Initial Model” that the model is pretty biased, with anywhere from 1% to 10%, or a high as 19% bias, depending on how you measure. In the next couple of sections, we try to remove those biases from the model.

4.2 Replacing Names in Training Data and at Inference Time

To address the identified biases in the initial model, the training data was further processed to replace given names from within the company names dataset with a particular string we call a token. The key insight with this approach is that instead of trying to learn about out-of-distribution “foreign” names or gendered names, the model can learn less about local given names and their related gender. The given name replacement task was first attempted using spaCy’s named entity recognition capability [23], but ultimately a dictionary-based approach proved more effective for the dataset in question. To obtain the results reported in this work, a long list of male and female names was obtained from a python library [24], which in turn obtained the name list from the 1990 U.S. Census [25]. One issue with removing all names from the business name is that a business name may be deleted if the full name of the business is a name, causing classification to fail. For example, the name “Denny’s” simply disappears into a blank string because “Denny” is also a given name. Another case where this issue arises is businesses registered as a person’s name e.g., “John Smith’’. A standardized string was used to represent the replaced name. Specifically, the character “_” was used as the replacement token.

After this additional pre-processing, a new FastText supervised learning model was trained on the dataset. A hyperparameter search of learning rates and number of training iterations resulted in the selection of 6 training iterations as the early stopping point, and a learning rate of 0.2 being selected. The embedding dimension width of the model was 100 dimensions, and the window size was set to 5. The performance of this model is recorded in Table 4 above.

The observed drop of approximately 4% in top-1 precision and recall relative to the original model is an understandable outcome of removing bias within the dataset which was providing signal for classification. This drop indicates that perhaps some classification performance was originating in name-based biases.

4.3 Augmenting Training Data with Additional Gender Information

A second method for addressing the identified biases in the initial model is to augment the training data with gender-swapped copies of the text. For example, the business name “Alice and Associates Plumbing Ltd” with the label Plumber can be observed as containing a given name within the aforementioned list of female given names, and a new training record with the label Plumber can be added to the dataset: “Bob and Associates Plumbing Ltd”. Similarly, training records containing male given names can be augmented with a gender-swapped version. The intuition is that a balance of gender information per label could cancel out gender bias for each label.

After the training data augmentation, a new FastText supervised learning model was trained on the dataset. A hyperparameter search of learning rates and the number of training iterations resulted in the selection of 6 training iterations as the early stopping point, and a learning rate of 0.2 being selected. The embedding dimension width of the model was 100 dimensions, and the window size was set to 5. The model performance is recorded in Table 4 above.

5. Analysis

Observe in Table 6 that as expected, the model evaluation on the out-of-distribution data (Approach 1) revealed a much higher bias than observed when testing on the within-distribution dataset (Approach 2). The approach to replace names with a fixed string during training and inference resulted in the elimination of bias in the out-of-distribution test. Specifically, in Approach 1 the classification imbalance per class dropped from 11.35% (σ = 2.23%) to 0.01% (σ = 0.00%). Regarding the results on the test data held back from the model training, observe that the bias dropped from 1.54% (σ = 0.08%) down to about a third, at 0.47% (σ = 0.06%). And so we can see that there was a significant drop in bias for both evaluation approaches we tried out in this article.

Surprisingly, augmentation of the training data with gender-swapping samples worked in [15], but was not as effective on the dataset studied in this work. The out-of-distribution bias results (Approach 1) were 3.32% (σ = 0.01%), which is significantly lower than the original model. However, the results were not as strong as the name replacement approach, and came at the expense of lower model performance and a worse imbalance on the test data from the company names dataset 1.90% (σ = 0.19).

Model stability was an issue with both the bias reduction approaches. The models were sensitive to very small changes. For example, models were sensitive to the capitalization of words in the business name such that “Bob’s Plumbing” was classified as Plumber, while “Bob’s plumbing” was classified as Restaurant. This sensitivity was a major factor in the classification disagreements between the samples fed to the models. For example, in the name replacement model, the input samples “Aria Lodge & Associates _ Ltd” and “_ Lodge & Associates _ Ltd” resulted in different class predictions, even though the text only differs by a few characters.

Note that although the results imply that replacing names somehow perfectly replaces all names in the training and testing data, this is not the case. The predictions do converge to result in very low classification disagreement based upon given names, but the dataset itself still includes several names that were not picked up by the 1990 U.S. Census names list. For example, the name Aria from the Canadian Female names list, and the name Ximena from the Mexican Female names list were both not removed by the name replacement approach. Furthermore, many names within the dataset such as Ho and Fraser were not removed. This observation implies that removing most given names from the training data was sufficient to address most of the problems. Also, there were scenarios where a token within the text was incorrectly replaced, deleting some information that could have been used for classification. For example, the business name “Lodge & Associates Investigations Ltd” lost the word “Investigations” because it has a substring “In” that matches a name in the names list. Removing this one case (“In”) from the names list did not improve model performance significantly, and so there is likely a collection of such name replacement precision and recall improvements that collectively would improve the model’s overall precision and recall. This additional direction for improvement is left as future work. Don’t you love it when a book says “ this is left as an exercise for the reader ”? This really means I have the idea but didn’t take the time to program the idea and test it.

6. Conclusion

A small business type classifier was developed in this article, and gender and geographic origin biases were observed in the model. Although the bias in the model was reduced by hiding given names from the model, this was accomplished at the expense of model performance. Augmentation of the training data with gender-swapping samples was less effective than the name hiding approach. In a full-on project, the precision and recall of the name hiding approach could probably be improved a lot. In the next article I write on this topic, I plan to look at an sk-learn pipeline for text classification, and removing bias using feature selection.

If you liked this article, then have a look at some of my most read past articles, like “ How to Price an AI Project ” and “ How to Hire an AI Consultant .” And hey, join the newsletter !

Until next time!

-Daniel

Lemay.ai

daniel@lemay.ai

7. References

[1] Lamponi, D.: Is Industry Classification Useful to Predict US Stock Price Co-Movements? The Journal of Wealth Management17(1) (2014) 71–77

[2] Kakushadze, Z., Yu, W.: Open source fundamental industry classification. Data2(2) (2017) 20

[3] Barra, M.: Global Industry Classification Standard (GICS). Technical report, Standard & Poor’s (2009)

[4] Fischer, T., Krauss, C.: Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research270(2)(2018) 654–669

[5] Brofos, J.: Ensemble committees for stock return classification and prediction(2014)

[6] Russell Indexes: Russell 3000® Index

[7] City of Vancouver: City of Vancouver British Columbia license by-law no. 4450(2019) https://bylaws.vancouver.ca/4450c.PDF .

[8] City of Vancouver: Business licences — city of Vancouver open data portal (2019) https://opendata.vancouver.ca/explore/dataset/business-licences/information/?disjunctive.statusdisjunctive.businesssubtype .

[9] UK Crown: Standard industrial classification of economic activities (SIC) (2018) Accessed on 12.25.2019 https://www.gov.uk/government/publications/standard-industrial-classification-of-economic-activities-sic .

[10] United States: North American Industry Classification System. Standard, Executive Office of The President — Office of Management and Budget (2017)

[11] FTSE Russell: Industry Classification Benchmark (ICB) Accessed on 12.26.2019 https://www.ftserussell.com/data/industry-classification-benchmark-icb .

[12] Bhojraj, S., Lee, C.M., Oler, D.K.: What’s my line? A comparison of industry classification schemes for capital market research. Journal of Accounting Research 41(5) (2003) 745–774

[13] Kakushadze, Z., Yu, W.: Statistical industry classification. Journal of Risk & Control 3(1) (2016) 17–65

[14] Dixon, L., Li, J., Sorensen, J., Thain, N., Vasserman, L.: Measuring and mitigating unintended bias in text classification. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, ACM (2018) 67–73

[15] Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.W.: Gender bias in coreference resolution: Evaluation and debiasing methods.arXiv preprint arXiv:1804.06876 (2018)

[16] Park, J.H., Shin, J., Fung, P.: Reducing gender bias in abusive language detection.arXiv preprint arXiv:1808.07231 (2018)

[17] Jiang, H., Nachum, O.: Identifying and correcting label bias in machine learning.arXiv preprint arXiv:1901.04966 (2019)

[18] Bolukbasi, T., Chang, K.W., Zou, J.Y., Saligrama, V., Kalai, A.T.: Man is to computer programmer as woman is to homemaker? Debiasing word embedding. In: Advances in neural information processing systems. (2016) 4349–4357

[19] KAMATH, U.L., WHITAKER, J.: DEEP LEARNING FOR NLP AND SPEECH RECOGNITION. Springer (2019)

[20] Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759 (2016)

[21] BabyCenter, L.L.C.: Canada’s most popular names of 2017: Top 20 and trends (2018) Accessed on 12.24.2019 https://www.babycenter.ca/a25024668/canadas-most-popular-names-of-2017-top-20-and-trends .

[22] BabyCenter, L.L.C.: Los nombre m ́as comunes para beb ́es en el 2015 (2015)Accessed on 12.24.2019 via web archive https://web.archive.org/web/20160611202700/http://vidayestilo.terra.com.mx/mujer/los-nombre-mas-comunes-para-bebes-en-el-2015,878794be54caa410VgnVCM10000098cceb0aRCRD.html .

[23] Honnibal, M., Montani, I.: spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear (2017)

[24] Hunner, T., Visser, S.: Random name generator Accessed on 12.25.2019 https://github.com/treyhunner/names .

[25] census.gov: Frequently occurring surnames from census 1990 — names files (September 2014) Accessed on 12.26.2019 https://www.census.gov/topics/population/genealogy/data/1990census/1990censusnamefiles.html .

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网