Schlagwort-Archive: genAI

Licenses, „Signals“ and genAI Scrapers

Note: As usual in this blog, this is the place where I publish personal musings and ramblings. Usually, this is a place to sketch things out, akin to thinking aloud. For this text, this is even more the case than usual. So far, I have written down this text in one session. It has not yet been adorned with pretty illustrations.

The age of “AI“ is upon us. And when people talk about “AI”, nowadays this can mean anything from the whole research field, to diverse flavours of Machine Learning, to “something magic that automates stuff”. Recently, it’s mostly generative systems like LLMs with all their side effects.

There’s a whole catalogue of baggage that comes with producing, disseminating and using generative systems, and enough people have written about these that I feel more than comfortable to skip right to the debate that popped up yesterday when Creative Commons released their CC Signals proposal. I think that the thinking behind those licenses is very flawed and gnaws at the very foundations behind the ideas of Digital Commons in more than one instance. Of the groups of humans onto which the costs of training genAI models are externalized, it solely addresses creators, leaving everybody else by the wayside. I’d rather bin the whole idea, start anew from scratch, actually sketch out an idea of a world we’d like to live in and work from there.

However, I was intrigued by the critiques that have popped up since yesterday, because they seem to delve in rather different directions and sometimes seem to accept power imbalances and societal constructs as immutable givens. The discussion is, of course, still ongoing. Nonetheless, I felt I should write down what popped into my mind after diving into the proposal.

What are CC Signals?

That proposal – and the discussion that followed on Mastodon and the Github issues – seems to center around one specific side effect, namely the practice of scraping huge swaths of the Web to gather material that is then used to train generative models. Be it text for LLMs, imagery and videos for image generators, or all of the above for hybrid models.

People who actually produced the works that are now being scraped to produce generative models have been… rather unamused by this practice. Creative Commons (the organization) appears to have been compelled by this sentiment to propose so-called Signals that are independent of the existing CC Public Licenses. They may be used in conjunction with CCPL, but do not need to, as far as I can tell from the writeups.

This kind of makes sense, since one open question is whether traditional licenses such as CCPL even make a legal difference when it comes to the question whether a third party may scrape and analyse corpora on the Web. More on that later.

The proposed Signals have four dimensions which I will try to summarize here as closely as possible:

  • Credit: This signals an expectation that the provenience of source material is credited (“appropriate credit based on the method, means and context of your use“)
  • Direct Contribution: This signals an expectation of “monetary or in-kind support to the Declaring Party”, “based on a good faith valuation” (of the source material and the effort to maintain it)
  • Ecosystem Contribution: As in the Direct Contribution, but this time providing monetary or in-kind support to “the ecosystem”. The expectation outlined in the accompanying writeup (PDF) is that this shall “support the commons as a whole”
  • Open: This sets the expectation that the resulting “AI System” shall be “Open” (intentional scare quotes). The accompanying writeup speaks of “reciprocity”, and the ways of satisfying this requirement is explicitly listed as adhering to the OSAID or the MOF Class II or Class I

Much could be written about the creeping re-definition of the term “Open” alone. First through introducing the term (“AI systems released under free and open-source licences”) into the AI Act (see also Recital 102, Recital 103, Recital 104 and Recital 107 of the AI Act). Then through the OSAID and the MOF. And now through CC explicitly referencing OSAID and MOF as “open”. And this all probably goes back even way longer.

Alas, we’re navigating through a hype-driven water park where Red Herrings abound. In my opinion, each and every of the proposed Signals introduces a rat’s tail of consequences that might work against a society where we freely share resources. So let’s dissect this piece by piece.

“AI” is more than generative systems

Just to get this out of the way and keep it in mind: When we’re trying to regulate or deal with “AI” in a legal sense, this encompasses more than just generative systems. The AI Act seems to explicitly include symbolic AI systems in its scope (Article 3, Recital 12).

I have not seen any definition within the CC Signals proposal what they actually mean to be “AI”. The language suggests that it is aimed towards trained models, but also announces categories such as TDM, training and inference. This all might very well encompass long-standing practices of scraping websites, building a formal knowledge base from the text on that site and using something as old-school as Prolog for inference. Just to set one’s environment variables, one might say.

Why Signals anyway? What’s with Licenses?

One question I’ve seen thrown around a bit since yesterday goes along the line of “why not create licenses that somehow regulate re-use for genAI training?”

Part of this question had already been answered some two years ago by Creative Commons in a blog post titled “Understanding CC Licenses and Generative AI”. The gist: All the Open Cultural Licenses we have come to use and like are based on copyright. They are applied to creative works, be that a piece of software code, this very blog post, a photo or video somebody made, a song, a painting… the list goes on. Under current copyright legislation, any of these creative works have the label “all rights reserved” slapped onto them. If somebody wants to reuse such a creative work in a way that requires permission under copyright, Open Cultural Licenses have been a way of granting a standardized license to everybody that grants them the rights to reuse under certain conditions. Naming the original creator, linking to the terms of the license? Those are clauses of many of the licenses that any re-user has to fulfill, otherwise the license is void.

The important part is the thing with reuse in a way that requires permission under copyright. Copyright has expanded… quite a bit over the past century. Often, this has been framed as a service to creators who shall live off their creative works. It is worth keeping in mind, though, that the actual benefiters of ever expanding copyright are usually the publishers. As Lawrence Lessig put it in 2003: “The publishers, such as the recording industry or the movie industry, aren’t so much defending the
rights of creators, they’re defending a certain business model.“

Despite that “feature creep” of copyright legislation, there have always been nices carved out that are not covered by it, in the form of limitations and exceptions – and that has been a good thing so far. Be it quotations, accessibility, research, private copies, freedom of panorama: Within the EU they all rely on those limitations and exceptions. Current efforts towards a right to repair seek to carve out new niches like this.

And, lastly, we have seen an exception for Text and Data Mining (TDM) within the EU. Up to that point, actors who scraped resources on the Web for analysis or providing better access to factual information (decoupled from protected creative expression!) within those resources ran a risk of being dragged to court. Open Data Activists were subjected to lawsuits when re-arranging information that is publicly available on the Web as recently as 2021. Legally, this last case is a bit different from TDM in the narrower sense, but I think it serves as an example how copyright legislation has been preventing us as a society to access and re-use information that ought in not only my opinion be freely reusable, by anybody.

Coming back to the point: Be it the TDM in the EU or Fair Use in the USA, there is reason to believe that those exceptions might apply to scraping and using information on the Web for anything AI, including for training generative AI models. Which means that this might be reuse that does not require permission under copyright. And therefore, any restriction within Open Cultural licenses, such as crediting a source, Share-Alike clauses or the dreaded Non-Commercial clause of the CCPL have no bite.

It is worth keeping in mind that Open Cultural Licenses have been and continue to be a “hack” of IP legislation. IP legislation and copyright assume a society in which creative works are meant to be licensed in exchange for currency or goods, and even need to be sold in order to meet one’s basic needs.
I have always understood the idea of Digital Commons to be an example in which we – or, more accurately, a bourgeois subset of society that can actually afford to do so and is not dependent on income from their creativity – provide creative works as Commons to the whole society. In order to demonstrate how a world could look like in which anybody can reuse such works for any purpose and how sharing and collectively caring for the Commons could be an alternative to established practices which we have come to take for granted and natural.
But more on that later.

I see your TDM exception, I raise with the TDM Opt-Out

Of course, I cannot just mention the TDM exception without mentioning the rights reservation mechanism that was introduced alongside it. Article 4 (3) DSMCD states that the exception applies on condition that “the use of works and other subject matter referred to in that paragraph has not been expressly reserved by their rightholders in an appropriate manner, such as machine-readable means in the case of content made publicly available online.”

How this rights reservation could look like in practice is part of an ongoing discussion. There has been an… original court decision that discusses how any human-readable declaration might also be machine-readable – because, the court reasoned, if those AI systems are as powerful as the vendors claim, they should have no problems at all to recognize a rights reservation declaration that has been scribbled onto a napkin, right? (This is yet another rabbit hole I do not have the time to dive into for this article. Unfortunately or fortunately :D)

Relevant for the Signals proposal: The stance of Creative Commons from 2021 is that the CC licenses “cannot be construed or interpreted as a reservation of a right in the context of Article 4 of the EU Directive on Copyright in the Digital Single Market (EU 2019/790) or any of its national transposition instruments.”

One could, of course, craft a new version of the CCPL that includes a module for TDM opt-out. I have no idea what effect this would have on fair use or any other exception outside the EU. Any such clause would, however, opt out of TDM for any purpose, with the exception of research “on a not-for-profit base or in the context of a public-interest mission recognized by the State.” (Recital 12 of the DSMCD).

Careful consideration would be necessary how not to “accidentally” craft a license regime that could be used by corporations to flat-out forbid any civic tech activists from scraping and republishing data from them in order to facilitate better access to information. And this goes right at the heart of the dilemma: All the proposals that I’ve seen so far that seek to use IP legislation to reign in genAI scraping are trying to use a legal tool that traditionally gives way more leverage to the wealthy and powerful.

Adversarial models: How to shoot your self in the foot, online.

After having set the scene, let’s finally look at the proposed Signals themselves. What I like to do in such cases is to Monkey’s Paw the heck out of them: Let’s assume this all is in place and construct some undesired consequences that come to mind.

Credit and where it might be due

We have become accustomed to crediting when using creative works that have been released under Open Cultural Licenses. Kudos at this point to all the licenses: They have been able to shape a cultural assumption that this is the way it ought to be!

However, when creating joint works that incorporate many sources – especially datasets – this has proven tedious to implement in practice. OpenStreetMap only incorporates data that comes with a waiver allowing them to credit contributions on an external page, not directly below the slippymap. And it might be rather obvious why this is the case.

The actual benefit of this attribution practice can also be called into question in certain circumstances. If you purchase a Smart Watch, or a modern automobile, or any sufficiently “smart” household appliance, chances are high that that product uses Free and Open Source Software that requires the vendor to deliver the respective license and attribution to you.

You might feel like you’re having a slow weekend and finally read all those licenses and attributions. Tapping through the display on your blender, or scrolling down the in-vehicle screen of a car, you are free to review sometimes thousands of pages that somebody took the pain to compile (and keep up to date) in order to adhere to the license. I have yet to find out how this strengthens the F/LOSS ecosystem, but here we are.

Things get even murkier when we talk about TDM in order to, say, craft knowledge bases for symbolic systems. Let’s say I have done that and now perform SPARQL queries on the end result. The sources of which parts of the graph do I have to display for appropriately crediting?

I can, to some degree, see the point when a LLM uses RAG for its output, as is mentioned in CC’s PDF writeup. That is, however, decoupled from the TDM and training stages that the Signals proposal claims to address. (This part also brings up memories from way back when some actors claimed that indexing the Web for search engines constitutes copyright violation – yet another rabbit hole.)

However, citing a source when using RAG kind of re-frames the whole reasoning behind why one should cite a source. In the case of re-using actual creative works, this is based on the author’s moral rights that are secured by copyright. We might be operating under the assumption that there is still a moral right to have my name cited if the text I produce here ends up being a mere drop in the bucket for any statistical analysis (again, let’s not get distracted by laser-focusing on genAI, let’s look at the general mechanism).

On the other hand, if I cite a source in making an argument, or if a scientist cites sources in an analysis, neither they nor I do that because this is a commonly accepted moral obligation. Instead, they and I cite sources because it is a method of underpinning the argument one tries to make. It enables others to reproduce the line of thinking and elevates a mere statement to something that could be knowledge.

There is also a whole load of practical considerations tied to the Credit Signal. From link rot to maintaining the credit repository to source disambiguation – even in everyday science applications, it is not always possible to link back to a canonical source even for some document that one “knows where it came from”. Mix in different steps of aggregation and you’ll end up with some form of crediting that is of no use to the “Declaring Party” and just drags along a cumbersome metadata catalogue.

All of this looks to me like a mere distraction. It does not solve anything that I can identify and invites bike-shedding discussions on how to actually implement it. This time could be better spent.

Direct and Ecosystem Contributions: Turning Commons into Commodities and introducing false incentives

I must confess, I find it very hard to wrap my head around the Direct and Ecosystem Contribution Signals. And even more so considering this proposal comes from Creative Commons. CC claims that this is some kind of “reciprocity” and that “A Thriving Commons Requires Reciprocity”.

I am tempted to slap a big [CITATION NEEDED] onto every of these statements. The PDF mentions that the Commons are based on an assumption “that we are all in this together”, that “knowledge and creativity are building blocks of our culture, rather than just commodities from which to extract value,” and that the social concept of reciprocity that CC proposes to introduce “is not transactional”.

How one can start at this point and end up with not one, but two Signals that are very, very much based on transactions, remunerating one side of the transaction monetary or in-kind, and going so far as trying to valuate the “good” that is being exchanged, is beyond me.

My first reaction to this was pretty much the same I had when first thinking about the TDM opt-out mechanism and how there is an entryway to monetizing stuff on the Web: Cool, now we have constructed an incentive to flood the Web with slop even more.

I believe that any financial incentive will be gamed. Maybe this will end up choking genAI vendors, maybe it won’t. But I can’t see how “we,” just ordinary denizens of the Web won’t bear the brunt of the whole Web getting even more enshittified.

But let’s assume this were to actually work, in the way it seems to have been intended. What could be the outcomes?

Releasing creative works to the public under CC licenses has, so far, been mostly an endeavour for people who can afford to do that. One might assume that they are at the wealthier end of the global spectrum – be it monetary or in free time (which usually can be exchanged for another). The proposed Signals open the possibility for these people to get some remuneration out of it. Nice, right? Lets put aside for a moment (sarcasm marker!) the issue that neither the click workers in the Global South that are exploited in the production process of genAI get anything out of this proposal, nor do the infrastructure administrators onto whom the costs of constantly scraping the Web for ever more genAI training are being externalized.

So, ignoring those pesky details, we do have some redistribution of funds between the genAI vendors (assuming they adhere to the Signals) and some of the more wealthy people, on a global scale. There might also be some redistribution between some genAI vendors and “The Ecosystem™”. Even nicer, right?

Just one question:

How?

What is the idea how this would work in practice? That’s a valid question, right? And I truly believe it pays to think this idea through to the end.
Does anybody remember Micropayments? Like Flattr? Feels like a blast to the past, huh? Or, wait, maybe that is not the right approach. What are other alternatives that already do this distribution? Oh, right. Collecting societies. You know, GEMA, MPLC, all those names that sound rather familiar to anybody who followed the copyright wars over the past, let’s say, two to three decades.

Or, of course, we could think about just creating our own collecting society! Independent from the baggage of the others! Something nobody has considered before… apart from, say, C3S, which had been proposed so long ago that its Blog with their launch announcement has already been offline for over ten years. Others might remember the Kulturwertmark (de) proposed by Chaos Computer Club in 2011.

This all feels like reinventing a wheel that has already been discussed way more deeply, without having analyzed the insights from way back when. Some already privileged people might benefit from the mechanism behind that. Others, who are also exploited in the process of genAI training, are not. It’s unclear how this is supposed to work in practice. We might gravitate towards creating new structures for that, which follow the form and shape of existing structures that have not proven to be beneficial to the idea of the Commons up to now.

Open: Reframing “Openness” beyond recognition

Lastly, the “Open” Signal. Le sigh.

Much has been written on the ongoing reframing of the term “Open” when it comes to genAI models. See, for instance, “Open (For Business): Big Tech, Concentrated Power, and the Political Economy of Open AI”, or “Rethinking open source generative AI: open-washing and the EU AI Act”. Other takes seem to have no squalms applying the term to constructs like “you can download a trained model and fiddle around a bit with its weights“.

Gone seem to be the ideas that this might all be about power structures and how to rebalance them. Granted, yes, I know, I know, even ideas like Free and Open Source Software have never been able to actually seize and distribute the Means of Production to the actual “masses”. Just like with Open Cultural Works, this has been mostly a court for a privileged few, so far.

One might argue that even this limited redistribution already kind of fulfills at least the first steps towards another vision of how we produce stuff as a society. How this makes another mode of production thinkable. How not all of society needs to partake in this production for the results being available to potentially everybody, without gatekeepers in the way. One has to remain optimistic, right? See the strange reciprocity argument: Even if 99,99% of all people just use Open Cultural Works in any sense (reading Wikipedia, using OpenStreetMap etc pp) without ever contributing back, that’s fine as long as there are enough other people contributing and maintaining the Commons without expectation of remuneration.

However, the main point for me is that this is at least somehow about shifting power dynamics. And it looks like, while “Openness” is in danger of becoming yet another Magic Concept or has already become one, the popularization of “Openness” has gone hand in hand with gutting anything power-relating from the term.

Personally, I find the OSAID definition and the MOF (maybe excluding Class I) highly problemativ and regressive. At most, many of the systems adhering to these definitions might have been called “freeware,” you know, back in the day. And even if those models were published in a way that could be replicated by anybody – even out of the select few who can now compile their own Linux kernel etc, only the tiniest fraction would be GPU-rich and electrical-power-rich enough to actually recreate any generative AI system even theoretically, and then we’re at the question whether we’d actually want even more actors going around scraping the Web at scale, putting even more external costs on the people running, you know, the actual infrastructure we have come to rely on and now we’re at the point of asking what is this “Openness” even supposed to mean anymore?

What I found very interesting is the question of how the enclosure of the produced models even works, in practice. Assuming, a model and its parameters have been published somewhere – regardless whether on purpose or “by accident”. How could a vendor slap a license on it and expect me to adhere to it? “The Mirage of Artificial Intelligence Terms of Use Restrictions” is one paper that dives into this question, not surprisingly also the IPO back in 2020. I had a hunch back in February that turned out to be right: The most likely framework within which any license could apply to a trained generative model is… the EU Sui Generis Right for Database Creators. Because training a model likely constitutes a substantive investment, and the SGDR is supposed to protect that investment.

So we have a situation in which scraping the Web for training purposes might not be subject to copyright law, but the resulting product is. Based on a controversial piece of legislation that has stood in the way of access to knowledge for everybody time and again. I’m not saying that this would solve anything, but maybe, just maybe, advocating for reforming SGDR might be a better idea than fiddling around with ever shoddier definitions of “Openness”.

Other Side Effects and introducing unwanted expectations

This has become far too long already, but some points should still be considered. This whole proposal looks to me as if it raises expectations that all this genAI stuff can be dealt with by just expanding on licenses, copyright and existing structures. It seems to normalize expectations of “crediting” that have proven to be a fractal broccoli of practical problems and of a moral right to determine not only how creative expressions (that are subject to copyright law) can be re-used, but also factual tidbits embedded in them, or statistical analyses. As described by Denny Vrandečić over a decade ago, popularizing such notions can lead to more restrictions to re-using data. And those notions are not only received by you and me who might be fed up with the current genAI hype, but also, say, public institutions looking for a way out of providing information to the public in re-usable form.

Also, even starting to think about how this could be enforced in practice dishes out yet another fractal broccoli. How would one identify somebody who does not adhere to a Signal, or an opt-out clause? Maybe, through some massaging, one can make some trained model to spit out a paragraph that looks like a direct lift from one’s work. But how did it end up there, especially in the age of “synthetic datasets” (i.e., building a Human Centipede AI)? Even if synthetic datasets were not a thing, let’s say I see crawlers that trawl my servers. Looking at how they try to hide behind residential IP ranges, what’s the idea what should happen then? Are we now arguing for Data Retention to get hold of the perpetrators?

Not to get me wrong: I am absolutely not arguing that the Web ought to be a free-for-all for genAI vendors. Not in the slightest. The point I am trying to make is that the approach I see in the Signals proposal has the potential to roll back the clock on achievements we made towards a sharing culture where we end up with more for everybody, and towards a Web that is not just yet another commercial marketplace. I fail to see any vision in the proposal what a future world based on Digital Commons could look like. It looks more like wanting to go incremental steps, starting from the status quo and taking all the constructs that are already in place as granted, normal and the way they ought to be. Simultaneously turning the Commons into Commodities, for whatever reason.
And, frighteningly, in more than one place, people now start arguing explicitly or at least in consequence for measures that have been considered “the weapon of the enemy” for decades.

We could instead talk about how the costs that are currently being externalized – on creators, but also on click workers and infrastructure maintainers – be internalized for the genAI vendors. The Signals proposal only addresses creators, and in a oddly skewed way. Maybe we should talk about taxation more. Or other means that redistribute the current gains of genAI vendors – if there even are any, who knows, maybe it’s just about capitalization – not only to creators, but to everybody. Which might lead to mundane things like school toilets being fixed. Or other goals we might like to have as a society.

Automation will create wealth, but under current economic conditions much of that wealth will flow to the owners of the automated systems, leading to increased income inequality.

Russell and Norvig. Artificial intelligence: a modern approach.

Then, of course, we would have to not just swim with the hype current and tack on some new wallpaper onto crumbling walls, but try to shift the conversation about genAI in general. How mechanisms for redistributing the gains and internalizing costs might be a desirable way forward even if they stand in the way of unshackled genAI futurism. Questioning whether unshackled genAI futurism is desirable in itself, instead of taking it as an immutable and logical way forward. How just fine-tuning the sham “Openness” dial or sprinkling some magic “for the common good” spices over that stew just doesn’t cut it.

And, not least, how a way towards more Commons, more sharing, more wealth for everybody could look like. I don’t know, maybe enabling way more people to have enough free time that they don’t need to spend at work or flooding the Web with slop in order to crunch a few pennies from genAI scrapers so they have a roof over their head and enough to eat might be an interesting idea. You know, just taking this idea of the Commons and look how this could not be just a luxury endeavor for a select few.

(Current, further reading: The Signpost from June 24th reviews recent research yet again. A valuable resource that tends to fly under the radar way too often.)