GPLed code on github (given the copilot controversy)
Paul Boddie
paul at boddie.org.uk
Tue Jul 13 12:09:55 UTC 2021
On Monday, 12 July 2021 23:16:22 CEST marc wrote:
> Hi, me again
>
> So I am going to respond to multiple comments in one go:
>
> I had a look at Julia Reda's post, and as far as I can
> make out, she only focuses on the fact that individual
> snippets are very short - but doesn't make any mention that
> inserting *lots* snippets algorithmically is *all* that
> copilot does...
I imagine that the thinking at GitHub here is that if anyone's code is copied
verbatim into something else, those people won't be able to assert their
copyright based on some kind of "lack of standing", even if a lot of other
people's code is also copied into the final work. Given that this kind of
defense worked for actual, substantial, alleged infringement of the Linux
kernel code by VMware, as opposed to chaotic copy-pasting of code fragments, I
can easily see Microsoft's lawyers feeling confident that GitHub can get away
with this, especially if combined with wide-eyed "it's an artificial
intelligence" nonsense.
(I really wish people would stop referring to the application of supposed
artificial intelligence techniques as *an* artificial intelligence, especially
since beyond the breathless hype, few of those people are likely to be
bothered with any of the broader ethical considerations at stake, particularly
when applications of artificial intelligence do become sophisticated enough to
merit concern about matters like the autonomy of such systems themselves.)
> In a way her position is understandable - I believe it
> is consistent with that of the pirate party - they are
> copyright minimalists, and would prefer a world with no
> copyright, as far as I can tell. But until IP lawyers call
> themselves TSOALGGM lawyers (that expands to "temporary
> stewards of a limited government granted monopoly" rather
> than "intellectual property") I am not sure if her view
> is representative of the current situation.
One might agree with her assertion that stronger and more severe copyright
laws are not necessarily helpful for Free Software, copyleft in particular,
given that copyleft is effectively meant to subvert the copyright regime to
promote the fair sharing of software. I also have to say that this thread is
the first I've heard of this matter, and since I follow the FSF's
announcements (and controversies) fairly actively, I wonder which "copyleft
scene" she is referring to. Maybe a bunch of people who have shovelled their
code onto GitHub because it was the popular thing to do, plus a bunch of
people on proprietary "social media" platforms?
> Another commenter said that the codebase used to generate
> the model is just somehow the "input" and not actually *in*
> model - but I am not sure the distinction is that clear. If
> I ROT13 a Metallica mp3, then there is an algorithmic
> transformation and new file is clearly different, but it
> is possible to recover the original. In the same way it
> could be argued that the copilot model encodes the input
> code in its weightings. I suppose there are some losses,
> but if I were to downsample and ROT13 a Metallica
> CD (I don't, I have decided not to like their music), I'd
> still be in trouble if I'd claim it as my own work, right ?
> And if I XOR it with a Rick Astley mp3, would that suddenly
> be fair use ?
There are quite a few things in the commentary that other commentators have
surely picked apart already, but even skimming the text provides quite a few
eyebrow-raising moments. For instance, the revelation that merely reading a
book does not infringe copyright might be worth repeating to, say, the music
industry, but what copyright is all about is indicated by its name. And
traditionally, copying information does not tend to include "copying" it into
one's brain via the visual system and other cognitive processes.
However, it is easy to see the top of the slippery slope at this point. What
if the software is "an artificial intelligence" (sigh) that is merely being
trained. It is then easy to imagine that companies might want to have things
both ways (as they do now, but that is another matter): their artificial
minion does all the work and isn't infringing anyone's copyright, but the
company gets to copyright the output. At the same time, they can plead that it
is just a machine and, unlike a human, cannot knowingly plagiarise other
people's works.
Another thing that stood out was this: "The output of a machine simply does
not qualify for copyright protection – it is in the public domain." Although I
recognise that within a specific context, it might be true, it certainly is
not unquestionably true beyond that context. A compiler takes source code and
produces object code, but that object code is not in the public domain. Having
spent the last few months indulging my nostalgia and reading old computing
publications, I am reminded of the outrage back in the 1980s when one company
produced a compiler for a microcomputer and then claimed that the output was,
at least in part, based on their original work:
"Softek compiler payments dispute"
https://archive.org/details/popular-computing-weekly-1983-05-26/mode/1up
Even if "computed" output is not subject to copyright, various inputs will be,
and as the output is a translation of some input, it may also be, too. In the
Free Software movement, people are fairly careful about such precedents for
good reasons. I say good luck to anyone wanting to test their legal theories
by, let us say, publishing a machine-translated version of one of the Harry
Potter books.
> Finally for more amusement value: A different conversation
> points out to me that I should have made my mail more
> click-baity: "Does copilot mean that microsoft has lost
> its license to distribute the linux kernel ?" I am still
> not sure - but maybe this bit of sensationalism makes is
> clearer what is at stake ?
One might argue that if the tool reproduces code fragments that are big enough
to be considered candidates for copyright infringement, since they are source
code fragments then the only obligation is to ensure that copyright and
licensing information is also provided to the user of the tool. That possibly
gets GitHub off the hook, but it then leaves the user of the tool to figure
out what the status of the resulting work might be. Maybe it should also be
generating a REUSE manifest to help that poor end-user.
Paul
More information about the Discussion
mailing list