Enlarge (credit: Aitor Diago | Moment)

More than 1,000 known child sexual abuse materials (CSAM) were found in a large open dataset—known as LAION-5B—that was used to train popular text-to-image generators such as Stable Diffusion, Stanford Internet Observatory (SIO) researcher David Thiel revealed on Wednesday.

SIO’s report seems to confirm rumors swirling on the Internet since 2022 that LAION-5B included illegal images, Bloomberg reported. In an email to Ars, Thiel warned that “the inclusion of child abuse material in AI model training data teaches tools to associate children in illicit sexual activity and uses known child abuse images to generate new, potentially realistic child abuse content.”

Thiel began his research in September after discovering in June that AI image generators were being used to create thousands of fake but realistic AI child sex images rapidly spreading on the dark web. His goal was to find out what role CSAM may play in the training process of AI models powering the image generators spouting this illicit content.

Read 29 remaining paragraphs | Comments

By