B2B Wins #7: Stealing content for profit

Does generative AI have an attribution problem?

Nov 28, 2022

Marketers are always seeking better content to engage and progress digital property visitors. Generative AI—technologies like DALL-E for images and GPT-3 for text—have recently captured our imagination. What they have also captured is the attention of software creators and content publishers.

These technologies are rapidly spawning applications and use cases that make it easier for publishers, including marketers, to rapidly and inexpensively create content.

I predict that many publishers—both in the media and among businesses—will soon be replacing human writers and visual artists with AI as their default content source. The great irony is that content created by people flawlessly trained their replacements. Any no person got paid for that privilege.

Stealing content. For profit. DALL-E style.

Did AI make that ugly picture?

Yes. Yes it did. It makes you wonder how this tech will take over the world.

Generative AI starts the creative process with something called a “prompt”. It’s a bit of text that explains what you’re looking for. I gave DALL-E the prompt “A bank robber robbing a website of its content, digital art” and after ten seconds or so DALL-E presented four renderings that it thought matched my prompt.

That image didn’t quite catch my fancy so I asked it to try again. It rendered the following as one of the options.

It could do this all day. You can refine the prompt yourself or simply ask DALL-E to guess again.

DALL-E can do this because it was trained on a bazillion (we really don’t know how many) images that contained captions. So it “knows” what a bank robber looks like and what a website looks like and what content is. That knowledge is used by the engine to interpret the prompt and render the thing that is requested.

If you want to see more examples of the types of prompts that DALL-E can respond to Dale Markowitz wrote a great article on this.

A key question is: Do they do this magic by stealing creator’s content?

Do generative AI engines steal?

Do these generative technologies really use someone else’s copyrighted content to create their magic?

We’re really not sure but it seems borderline miraculous that all of this could be rendered from free content sources. From what we know, in the legal sense they don’t steal. For example, it has been reported that GPT-3 trained itself off of documents like Wikipedia which are available in the public domain. So that’s not stealing.

It’s less clear on how visual models like GPT-3 are trained. Again, the reporting is that images with captions were used to train the models but there have been no reports on exactly how that was done. If you assume millions and millions of images are required there’s no Wikipedia equivalent for images. So where did all that data come from?

One way to get such content is to crawl websites much like a search engine would. Web crawlers can hoover up content wherever a browser can access content. This may violate the terms of service of some websites and trample creator’s copyrights. In a sense, theft. But again, isn’t content is on the internet for the taking? Most content creators would say “No!”.

Every content creator has at some time or another been approached with some version of the following offer: “Give us free content. You’ll get paid manyfold by the exposure you’re getting.” Con men.

They’ve also seen their content, if not a whole website, be copied by some third party. If you hang out in content communities you’ll hear horror stories.

You may have personal experience as a content thief. Who hasn’t grabbed an image from a Google image search for a presentation? Fair use or Theft? We don’t know if that’s what’s going on at DALL-E.

So the figurative jury is out on whether text and visual AI tech is gathering unauthorized content to fuel its algorithms. Where things are more clear is in the area of generative code.

Stealing from programmers

One generative tech that’s currently getting scrutiny of the litigious sort is generative AI that assists programmers in creating code.

GitHub is a programming workflow environment used by the vast majority of software engineers. GitHub, a Microsoft subsidiary, recently launched a technology called CoPilot. CoPilot helps programmers by suggesting code as programmers type. Think of it as type ahead for code. If finishes your sentences but your sentences look like:

Screenshot from Catalin Pit’s Copilot Demo on Yahoo

Notice how lines 1-3 in the example above are english language prompts that are translated into code by Copilot. Pretty magical.

There’s one problem. The original training was done on public sourced code and those open-source repositories require attribution. Software engineers using Copilot noticed that some suggestions used code copied from those sources. That sure seems like an easy to understand infringement of copyright.

Another example of the “Not Ok.” behavior of Copilot

The case brought against Microsoft and several other generative AI companies will begin to sort out how content owners can be made whole for the use of their content.

Whether your art is text, visual arts, or even computer code it is possible that today some generative AI technology is making use of that work for its own purposes. You may never know. You certainly will struggle to prove it unless you can find your “line of code” embedded in an AI generated piece of content.

Here’s what’s important for you

As a marketer (or any other business function) if you purchase creative assets generated by AI, beware. You should be well informed about the source of the data used to train AI employed in your business. If it can easily be proven that a work is derivative without the proper attribution you may be on the hook along with your vendor.

I will leave you with this: A 3D render of a penguin walking down the street

B2B Wins by Steve Zakur

Discussion about this post