Personal images of Australian children were used to train AI models without their consent despite platforms prohibiting web scraping and enforcing strict privacy settings, according to a report by Human Rights Watch (HRW). The weblinks in the dataset even revealed details about the children, including their names and locations where the picture was taken.
HRW has found about 190 photos of children from across Australia, including indigenous children who may be especially vulnerable, that were used to train the AI model . This follows an earlier report by HRW that said 170 photos of Brazilian children were used to train popular AI dataset called LAION-5B, which was built from Common Crawl snapshots of the public web.
Researcher Hye Jung Han noted that the 190 photos are only .0001 percent of the 5.85 billion images and captions used to train the AI model. She added that these photos had been scraped “without the knowledge or consent of the children or their families,” and spanned their whole childhood.
The report also shared that “information about these children does not appear to exist anywhere else on the Internet,” showing that their families had been particularly cautious to protect the children’s identity online.
In one case, Han found that a YouTube video with two boys had been unlisted so as to not appear in searches but it was still part of the dataset.
(For top technology news of the day, subscribe to our tech newsletter Today’s Cache)
A YouTube representative spoke to Ars Technica about the issue saying they have been “clear that the unauthorised scraping of YouTube content is a violation of our Terms of Service, and we continue to take action against this type of abuse.” But given their presence in the dataset, it is likely that tools have already been trained on this content.
LAION, a nonprofit that builds AI datasets has been working with HRW to clean up flagged images but the process hasn’t been a fast one.
Published - July 03, 2024 04:15 pm IST