This statement was last updated on .
Introduction
The past few months have launched generative AI models into the public eye, and everyone seems to have a take on it. Generative AI models such as large language models (LLMs) and AI art generators consume vast amounts of aggregated content, determine similarities between that content, and, when prompted, produce statistically likely, plausible-seeming output.
The current state of generative AI is environmentally disastrous and built on the backbone of labor exploitation, particularly in the global south. Large language models' disregard for the truth is, at this point, well-documented.
Technology is not neutral. Leveraging and normalizing generative AI is not a neutral act.
On AI-Generated Content on This Site
I do not use generative AI models to write content on this site, nor do I use AI art generators to build graphics for the site. I also designed and implemented this site without the use of generative AI tooling.
If I do include AI-generated writing or art, it will be to illustrate a point about generative AI, and such use will be clearly marked.
There is one area where I have extensively benefitted from the use of AI in content creation, and this is in authoring closed captions and transcripts for recordings of my past streams using Descript. I vetted and cleaned up these AI-generated captions and transcripts afterwards to ensure their accuracy.
On Scraping and Training
Major generative AI models have thus far operated with a flagrant disregard for consent by treating web content at large as fair game.
Some large AI vendors have now publicized ways to opt your content out of their web crawlers, which would ideally ensure your content is not used in their training sets. That said, these opt-out methods strike me as too little too late, now that the vendors have already found great success with the content they had already scraped. Additionally, opt-out paradigms have a few objectionable shortcomings that make them ring hollow:
- Opt-out mechanisms benefit only a select few people who are aware such mechanisms exist, and who are technical enough and have enough ownership over their content's hosting to implement such mechanisms.
- Opt-out mechanisms assume a site is actively maintained and that someone is still around to provide consent. Abandoned sites and sites whose owners are deceased can't opt out, so there's consent, right?
- Opting out is a Sisyphean game of whack-a-mole. Opt-out mechanisms rely on having faith that training models will observe your preferences. Some of these mechanisms are vendor-specific, and very few models have any opt-out mechanisms, since these have to be specifically implemented by the vendor. Even if they did, it'd be unreasonably taxing to hunt all of them down and take advantage of them.
- Opt-out mechanisms assume content hasn't been republished on more permissive sites. An unfortunate truth of the web is that web scrapers will often republish content wholesale, usually in an SEO grab. As Chris Coyier puts it,
All it takes is one scraper website that republishes the content and doesn’t have an identical
robots.txt
and that’s that.
To that end, I endeavor regardless to opt out of the content hosted on benmyers.dev being used to train models. Known web crawlers used for training AI models will be disallowed in my robots.txt
. While I don't believe it has much traction in the wild yet, I've also set up a machine-readable ai.txt
file, using Spawning.ai's proposed format.
Isn't Machine Learning… Learning?
Generative AI enthusiasts often argue that machine learning is akin to the way humans learn. After all, if a human can learn by drawing connections between experiences they've witnessed, and then create something of their own while drawing from those experiences as inspiration, why should AI be treated any differently? In my opinion, this is fallaciously anthropomorphizing machine learning, reasoning about the model as though it were sort of like a new consciousness that just blinked into existence, absent the context of its creation. Instead, I'd offer an alternative mental model: generative AIs are tools, devised and deployed by corporations to operate at scale, laundering content the corporations generally do not own with the active intention of reducing the need for, and the value of, human labor. Whether machine learning is like human learning is irrelevant to the real-world impact of its use.