- NVIDIA StyleGAN
- METFaces Dataset
- FairFace Algorithm
- Low Light Detection
- FET Python
- 1,200,000 Generations
- 32 Filial Generations
- 7 Racial Groups
- African (Black)
- South East Asian
- South Asian
- Latino Hispanic
- East Asian
- Middle Eastern
I’m Not Reading All That!
In early 2020 we set off to build a database of realistic human faces for ourselves and we quickly found it very challenging to find usable and more importantly diverse faces for our machine learning algorithms.
Then an intern had a bright idea: “why don’t we create our own faces.” The Human GAN project was born and the TL:DR of it, a sorted, NON-COMMERCIAL, dataset of generated human faces is available below for you to use in augmenting your own datasets for facial analysis, machine learning training and data validation. The generated images are only possible courtesy of NVIDIA’s StyleGAN and the METFaces Dataset.
”In both academia and industry, computer scientists tend to receive kudos (from publications to media coverage) for training ever more sophisticated algorithms. Relatively little attention is paid to how data are collected, processed and organized.James Zou & Londa SchiebingerAI can be sexist and racist — it’s time to make it fair (2018)
1. Scope Of The Project
Racial Biases Present In AI Is Very Much Scientific Fact
Critical research by Buolamwini and Gebru 2018 highlight an overrepresentation of light skinned men and women in commonly used Machine Learning datasets such as IJB-A and Adience (79.6% for IJB-A and 86.2% for Adience are light skinned subjects). This underrepresentation of darker skinned faces leads to higher false negatives and more frustrating user experiences for darker skinned individuals, such as with the Iphone’s facial recognition unlock struggling in low light conditions.
Klare Et al 2012, has shown that algorithms used by US Law Enforcement consistently underperform for people classified under ‘Black,’ ‘Female’ or between 18-30 in age. Similarly Ngan Et al 2015 has shown female labelled faces to have higher incorrect predictions than male labelled faces, representing a clear discrepancy in gender and racial performance of machine learning algorithms.
While gender and skin colour play a significant role in A.I. prediction accuracy, at QOVES our background in facial aesthetics leads us to also believe that Racial Phenotype is the more important factor we should be focusing on, supported by Buolamwini and Gebru 2018. Phenotypic labelling, i.e. by a person’s phenotype such as skin colour, facial features and body dimensions, is in our opinion better in than ethnicity or racial classification. As Buolamwini’s paper states, “Black in the US can represent many hues.”
It is better to initially balance a dataset by physical characteristics, rather than arbitrary racial labels.
BUOLAMWINI, J. AND GEBRU, T. (2018). Gender shades: intersectional accuracy disparities in commercial gender classification. In: Friedler, S.A. andWilson, C. (editors), Proceedings of the 1st Conference on Fairness,Accountability and Transparency, in Proceedings of Machine Learning Research, New York University, NYC, pp. 1–15
Brendan F Klare, Mark J Burge, Joshua C Klontz, Richard W Vorder Bruegge, and Anil K Jain. Face recognition performance: Role of demographic information. IEEE Transactions on Information Forensics and Security, 7(6):1789–1801, 2012.
Mei Ngan, Mei Ngan, and Patrick Grother. Face recognition vendor test (FRVT) performance of automated gender classification algorithms. US Department of Commerce, National Institute of Standards and Technology, 2015.
1.1 Why Does This Matter?
A Vicious Cycle of Human Bias
Consider this: two people, equal in every way but one is white and one is black. The black individual has a misdemeanour on her record from when she was 15 for mistaking another girl’s bike as her own. As soon as she realised, she dropped the bike and gave it back, but it was too late, a nosey neighbour had already called the police. She was arrested and had a misdemeanour against her name for an honest mistake, paired with over policing and a gross overreaction from her prejudiced neighbour.
Now what about the white individual? She also has misdemeanours against her name from petty theft. She would often give herself a ‘five-finger-discount’ of candy bars and soda. One day she wanted to up her ante and tried to steal $80 of electronics from the local Walmart. Problem is, the Loss-Reduction Officer already had his eye on her from the beginning and she was immediately caught and charged. Both her and her black counterpart had
misdemeanours for the same petty theft, difference is one was an accidental mistake and the other was deliberate. When law enforcement agencies use A.I. years later to assess them for risk of recidivism (reoffending), the black individual will rank significantly higher at risk of reoffending, despite their misdemeanour charges being the exact same.
It is plainly apparent that there are significant biases in our thinking, from looks-based discrimination, sexism, heightism, ageism to outright racism. These biases are carried into machine learning algorithms, in the same way a parrot can mimic human voices and words but does not actually understand what it is saying. Over-policing of minorities leads to disproportionately higher arrest rates, which in turn lead to A.I. algorithms flagging anyone with similar facial features as high risk and the cycle continues.
2. Understanding The Problem
Data, Data Labels, Models, Model Outputs, and Social Analyses
In data science, the first step of producing a machine learning algorithm is to feed it with training data. This is where the first prong of this multi-faceted issue lies. The industry may seem diverse, but if you are upto date on the latest A.I. literature, you may quickly realise that much of the research is based on the same handful of datasets. Datasets like ImageNet which contains 14 million images or the Yahoo Flickr Dataset (YFCC100M) of 100 million images are the most commonly used because of their tremendous size. While a 100 million images is a lot of diversity, overtime with everyone using the same few datasets, that diversity means very little, especially as only a small subset of that data is actually usable. Whatever imbalances, biases or weights are present in these datasets, get carried over to new projects.
Creating novel new datasets like the Pilot Parliamentary Benchmark or the FairFace Dataset are crucial to forcing researchers to increase diversity in their machine training by making the data easily accessible.
The way these datasets are labelled is problematic too. This is moreso human error where an overrepresentation of Western-centric culture is present in the data. As such, computers pick up on a traditional bridal wedding dress (the white gown and all) as ‘bride’ and ‘wedding’ but detect an Indian wedding gown or Sari as ‘Performative Art’ (1). The computer is not actively trying to discriminate against these differences, but rather the data is poorly labelled. Both are wedding dresses and should be labelled with individual attributes, i.e. Indian Bridal Gown, American Bridal Gown… and so forth to add clarity. Using basic, Western-centric nomenclature in labelling adds to the divide between the West and everyone else, which makes the majority of the world’s population.
Shankar, S. et al. Preprint at https://arxiv.org/abs/1711.08536 (2017)
As machine learning algorithms are costly to produce scientists use different methods to augment their dataset. The most costly part of this process is not actually developing the algorithm, that’s quite straightforward and past boilerplate algorithms can be modified to suit a use case, rather the cost is in gathering the data. Medical algorithms in detecting skin cancer are a common but incredibly beneficial use case of A.I., but their training is often flawed. Many M.L algorithms are trained with unbalanced datasets with an overrepresentation of fair skin or Caucasian features (due to the availability of their images online), with one algorithm using as little as 5% dark skinned images for training (1). These algorithms are poorly trained but are still considered ‘safe for use’ in the diagnoses of coloured patients for which it was not trained for.
The consequences of a false negative (i.e. the patient does have skin-cancer, but the A.I. did not detect it) can be fatal. While developers can skew the system to be more conservative in it’s guesses, this problem should be avoided right from the beginning by using better balanced datasets of more racial groups.
Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017 Feb 2;542(7639):115-118. doi: 10.1038/nature21056. Epub 2017 Jan 25. Erratum in: Nature. 2017 Jun 28;546(7660):686. PMID: 28117445; PMCID: PMC8382232.
3. Our Proposed Solution
Using NVIDIA’s StyleGAN FFHQ model we generated a large pool of faces. In our case, we went with an initial batch of 100,000 images at project time of 120 GPU-hrs. This was cut down from 1,000,000 as inferencing and generating faces is relatively easy, but filtering and augmenting them for retraining is very time consuming and scales exponentially in costs.
Very quickly, it was apparent that there was a significant bias towards Caucasian faces as faces of colour were rarely seen generated in the initial pool. This is unsurprising as the FFHQ dataset that NVIDIA used to train their GAN is also heavily unbalanced for racial groups.
Initial Data Distribution From FFHQ StyleGAN
Clearly the data is very unbalanced, which is where the QOVES Labs team comes in to synthesize new faces from the existing generated ones. For instance we can combine two existing African faces to create a uniquely new one, giving us now 3 faces to work with. This is called Style-Mixing and is one of the reasons for why we have taken this GAN approach with NVIDIA’s StyleGAN.
3.1 GANS…? What On Earth Are GANS?
Rise Of The Generated Adversarial Networks
Ian Goodfellow et al (2014) proposed a machine learning architecture in which two separate machine learning models would be pit up against each other. One would be hell-bent on creating the most realistic looking xyz… in our case human faces, possible. This would be known as the generator while the other model would be doing its best like a fine-toothed detective to differentiate when the generated product is actually made from a computer or an actual image (discriminator). For every image the discriminator successfully calls out as a 2-bit hack job, the generator goes back to the drawing board and modifies its neural network parameters to attempt to fool the discriminator again. This process of trial and error is computationally heavy and time-consuming, but it has produced stunning results in the past few years. This process of generated adversaries fighting it out for eternity with each other is where the name comes from, shortened to G.A.N.
Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014)
8 Versions Of The Same Seed (Stochastic Variation)
3.2 NVIDIA’s StyleGAN
Because It’s Very… Stylish
While there are more GAN models than animals at my local zoo out there on GitHub (no really, check out the GANZoo Repository), none suit our purposes of creating real, fake humans as well as NVIDIA’s StyleGAN. Among their many iterations proposed by Karras and colleagues in 2018, StyleGAN ADA-2 is their latest and most useful iteration for our purposes.
Their GAN is a ‘style-based’ generator, which in simple terms focuses on high-level attributes of an image, for instance the pose or phenotypical features of a face, or even emotions at different aging levels. Then, it adds stochastic variation which is a fancy way of saying random changes to the image to generate truly unique faces. This can be in the form of hairstyles or blemishes such as freckles or moles for instance. A million uniquely different faces can be generated from the exact same seed number, i.e. slightly different versions of the same person. StyleGAN provides the most control we have seen over other GANs due to it’s synthetic and stylistic nature in facial features and attributes, where we can virtually trial and error parameter values in the model to generate the exact type of face we want. Plus, StyleGAN comes pretrained with the Flicker Faces HQ dataset to produce 1024×1024 faces right off the bat which is incredibly useful for transfer learning new models, as training from scratch can take upto 41 days, even with a beast of a GPU such as the Tesla V100.
As of November 2021, this project will be using StyleGAN-3 for more realistic alias free generative images.
Karras, Tero, Samuli Laine, and Timo Aila. "A style-based generator architecture for generative adversarial networks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
Why we chose StyleGAN:
- Pretrained facial model
- Stylistic control
- Fast(ish) generative time
- Highest quality images (so far)
- Largest usable images (so far)
- ADA method reduces training dataset
- Style mixing ability
3.3 Augmenting Our Dataset
Style-Mixing New Humans
The human genomic sequence expanded through parent generations coming together to create new variations through sexual reproduction. Our approach to diversifying Machine Learning datasets involves a similar approach of mixing two similar or dissimilar parents together to create new variations that are just realistic enough to trick A.I. algorithms into thinking they’re new human faces, and thus making them more capable and robust when deployed to detect a wider variety of faces. This idea isn’t anything new; Minui Hong and colleagues have proposed a similar method of dataset augmentation using style mixing to create new faces, cancer cell shapes or even pictures of cats.
Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014)