AI Update: Tiny Images - A Good Example of the “Bad” AI the EU Wants to Regulate
28 July 2020
It was recently reported that MIT had deleted its much-used and often-cited Tiny Images dataset. It seems that it has also asked researchers and developers to cease using the library to train artificial intelligence and machine learning systems. Researchers at University College Dublin and UnifyID, an authentication startup, found racist, misogynistic, and demeaning labels among the nearly 80 million pictures in Tiny Images. The UCD researchers had conducted an “ethical audit” of several large vision datasets, each containing many millions of images. They focused on Tiny Images as an example of how social bias proliferates in machine learning.
The building blocks of the offending dataset are an older dataset created by psychologists and linguists at Princeton in 1985. They compiled a database of word relationships called WordNet. Scientists at MIT compiled Tiny Images in 2006 by searching the internet for images associated with words in WordNet. The database includes racial and gender-based slurs, so Tiny Images collected photos labelled with such terms.
The UCD/UnifyID researchers are quoted as saying:
“Not only is it unacceptable to label people’s images with offensive terms without their awareness and consent, training and validating AI systems with such dataset raises grave problems in the age of ubiquitous AI. When such systems are deployed into the real-world, in security, hiring, or policing systems, the consequences are dire, resulting in individuals being denied opportunities or labelled as a criminal. More fundamentally, the practice of labelling a person based on their appearance risks reviving the long discredited pseudoscientific practice of physiognomy.”
The EU’s Plans to Regulate AI
The High Level Expert Group tasked by the EU to plot the bloc’s approach to investing in and regulating AI has already specifically recommended a number of requirements for regulating the use of bad datasets just like Tiny Images. If and when these proposals for the regulation of AI are agreed they could include:
Requirements ensuring that AI systems are trained on data sets that are sufficiently broad and cover all relevant scenarios needed to avoid dangerous situations
Requirements to take reasonable measures aimed at ensuring that use of AI systems does not lead to outcomes entailing prohibited discrimination. These requirements could entail in particular obligations to use data sets that are sufficiently representative, especially to ensure that all relevant dimensions of gender, ethnicity and other possible grounds of prohibited discrimination are appropriately reflected in those data sets
Obligations to maintain accurate records regarding the data set used to train and test the AI systems, including a description of the main characteristics and how the data set was selected and in certain justified cases, the data sets themselves, and
Obligations to maintain documentation on the programming and training methodologies, processes and techniques used to build, test and validate the AI systems, including where relevant in respect of safety and avoiding bias that could lead to prohibited discrimination
Why this is important
There is a growing push for companies and developers to bake ethics into their AI and ML processes and procedures. Many tech companies are already doing this but with the EU proposals for the regulation of AI it’s likely that in the near future AI and ML based products could be refused access to the lucrative EU market unless they first prove their products meet the proposed EU standards. Based on the current proposals any AI/ML products trained or tested on the Tiny Images datasets could be excluded!
The content of this article is provided for information purposes only and does not constitute legal or other advice.