GPLed code on github (given the copilot controversy)

Tue Jul 13 12:09:55 UTC 2021

On Monday, 12 July 2021 23:16:22 CEST marc wrote:
> Hi, me again
> 
> So I am going to respond to multiple comments in one go:
> 
> I had a look at Julia Reda's post, and as far as I can
> make out, she only focuses on the fact that individual
> snippets are very short - but doesn't make any mention that
> inserting *lots* snippets algorithmically is *all* that
> copilot does...

I imagine that the thinking at GitHub here is that if anyone's code is copied 
verbatim into something else, those people won't be able to assert their 
copyright based on some kind of "lack of standing", even if a lot of other 
people's code is also copied into the final work. Given that this kind of 
defense worked for actual, substantial, alleged infringement of the Linux 
kernel code by VMware, as opposed to chaotic copy-pasting of code fragments, I 
can easily see Microsoft's lawyers feeling confident that GitHub can get away 
with this, especially if combined with wide-eyed "it's an artificial 
intelligence" nonsense.

(I really wish people would stop referring to the application of supposed 
artificial intelligence techniques as *an* artificial intelligence, especially 
since beyond the breathless hype, few of those people are likely to be 
bothered with any of the broader ethical considerations at stake, particularly 
when applications of artificial intelligence do become sophisticated enough to 
merit concern about matters like the autonomy of such systems themselves.)

> In a way her position is understandable - I believe it
> is consistent with that of the pirate party - they are
> copyright minimalists, and would prefer a world with no
> copyright, as far as I can tell. But until IP lawyers call
> themselves TSOALGGM lawyers (that expands to "temporary
> stewards of a limited government granted monopoly" rather
> than "intellectual property") I am not sure if her view
> is representative of the current situation.

One might agree with her assertion that stronger and more severe copyright 
laws are not necessarily helpful for Free Software, copyleft in particular, 
given that copyleft is effectively meant to subvert the copyright regime to 
promote the fair sharing of software. I also have to say that this thread is 
the first I've heard of this matter, and since I follow the FSF's 
announcements (and controversies) fairly actively, I wonder which "copyleft 
scene" she is referring to. Maybe a bunch of people who have shovelled their 
code onto GitHub because it was the popular thing to do, plus a bunch of 
people on proprietary "social media" platforms?

> Another commenter said that the codebase used to generate
> the model is just somehow the "input" and not actually *in*
> model - but I am not sure the distinction is that clear. If
> I ROT13 a Metallica mp3, then there is an algorithmic
> transformation and new file is clearly different, but it
> is possible to recover the original. In the same way it
> could be argued that the copilot model encodes the input
> code in its weightings. I suppose there are some losses,
> but if I were to downsample and ROT13 a Metallica
> CD (I don't, I have decided not to like their music), I'd
> still be in trouble if I'd claim it as my own work, right ?
> And if I XOR it with a Rick Astley mp3, would that suddenly
> be fair use ?

There are quite a few things in the commentary that other commentators have 
surely picked apart already, but even skimming the text provides quite a few 
eyebrow-raising moments. For instance, the revelation that merely reading a 
book does not infringe copyright might be worth repeating to, say, the music 
industry, but what copyright is all about is indicated by its name. And 
traditionally, copying information does not tend to include "copying" it into 
one's brain via the visual system and other cognitive processes.

However, it is easy to see the top of the slippery slope at this point. What 
if the software is "an artificial intelligence" (sigh) that is merely being 
trained. It is then easy to imagine that companies might want to have things 
both ways (as they do now, but that is another matter): their artificial 
minion does all the work and isn't infringing anyone's copyright, but the 
company gets to copyright the output. At the same time, they can plead that it 
is just a machine and, unlike a human, cannot knowingly plagiarise other 
people's works.

Another thing that stood out was this: "The output of a machine simply does 
not qualify for copyright protection – it is in the public domain." Although I 
recognise that within a specific context, it might be true, it certainly is 
not unquestionably true beyond that context. A compiler takes source code and 
produces object code, but that object code is not in the public domain. Having 
spent the last few months indulging my nostalgia and reading old computing 
publications, I am reminded of the outrage back in the 1980s when one company 
produced a compiler for a microcomputer and then claimed that the output was, 
at least in part, based on their original work:

"Softek compiler payments dispute"
https://archive.org/details/popular-computing-weekly-1983-05-26/mode/1up

Even if "computed" output is not subject to copyright, various inputs will be, 
and as the output is a translation of some input, it may also be, too. In the 
Free Software movement, people are fairly careful about such precedents for 
good reasons. I say good luck to anyone wanting to test their legal theories 
by, let us say, publishing a machine-translated version of one of the Harry 
Potter books.

> Finally for more amusement value: A different conversation
> points out to me that I should have made my mail more
> click-baity: "Does copilot mean that microsoft has lost
> its license to distribute the linux kernel ?" I am still
> not sure - but maybe this bit of sensationalism makes is
> clearer what is at stake ?

One might argue that if the tool reproduces code fragments that are big enough 
to be considered candidates for copyright infringement, since they are source 
code fragments then the only obligation is to ensure that copyright and 
licensing information is also provided to the user of the tool. That possibly 
gets GitHub off the hook, but it then leaves the user of the tool to figure 
out what the status of the resulting work might be. Maybe it should also be 
generating a REUSE manifest to help that poor end-user.

Paul