GPLed code on github (given the copilot controversy)

Daniele "Mte90" Scasciafratte mte90net at gmail.com
Mon Jul 12 09:51:07 UTC 2021


Also I experience something similar of CoPilot for the Mozilla Italia DeepSpeech Italian model (https://github.com/MozillaItalia/DeepSpeech-Italian-Model).

When I studied how to deal with various audio+text/text-only italian datasets I talked a bit with other people of the Machine Learning community and also with some lawyers.
It is a grey area because the focus is that you are generating a model that don't let you to recreate the (in our case) original materials and also doesn't store (in some cases) the real trained content but just numbers.

So for the audio+text dataset we chosen to use only dataset with licenses that are CC or public domain (very few as a lot of them are academic and don't use license at all), instead for text-only we used various sources also with no licenses.
This because we are aggregating all the sources together, removing duplicates, symbols and other stuff so it is not possible to recreate the original material.
The model generated is released as CC0 and mention all the dataset used and for the text-only it is the same, also we release all the scripts to generate it but we don't release the files created during the parsing but just the final output.

Similar of what is doing the https://github.com/common-voice/cv-sentence-extractor that is using Wikipedia and Wikisource as source but they pick for every article just 3 sentences randomly. I know that for the project was involved the Mozilla Legal team and also if Wikipedia is CC0 they preferred that way.

The issue I see with CoPilot, but also with Kite or TabNine, that are all services for autocomplete or auto write code trained on open source code, is they can recreate part of the code but not all of that but the law doesn't mention the amount of that.
As example in Italy is allowed to photocopy just the 20% of a book by the copyrights laws.

PS: the story of the italian project with a talk at fosdem 2020 https://archive.fosdem.org/2020/schedule/event/how_to_get_fun_with_teamwork/ or the written version https://daniele.tech/2019/12/how-the-italian-deepspeech-model-helped-our-mozilla-italia-community/


Daniele Scasciafratte - OpenSource MultiVersal Guy
daniele.tech <https://daniele.tech> - @Mte90Net <https://twitter.com/Mte90net> - GitHub <https://github.com/Mte90> - Italian Linux Society council member <http://www.ils.org/> - Mozillian <https://people.mozilla.org/p/Mte90>
Mozilla Reps, Mozilla TechSpeakers, WordPress Core Contributor <https://profiles.wordpress.org/mte90>, FSFE member <https://fsfe.org/>,
LibreItalia member <http://www.libreitalia.it/soci/>, Wikimedia Italia member <https://www.wikimedia.it/> and LUG Rieti founder <http://lugrieti.linux.it/>.
Il 12/07/21 11:12, Paul Schaub ha scritto:
> Hey,
>
> while I can't answer your questions, here is an article by Julie Reda,
> arguing that Copilot is not in fact infringing copyright and that the
> copyleft movement would not benefit from stricter copyright rules:
>
> https://juliareda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/
>
> Regarding your question about music, there is an interesting provocative
> project:
>
> Fairuseify.ml uses a neural network to "learn" music that you upload to
> it. You can then download what the network "learned" (which in my
> experiments pretty much sounds like what you uploaded).
>
> https://fairuseify.ml/
>
> I'll watch this debate closely.
>
> Paul
>
> Am 10.07.21 um 10:58 schrieb marc:
>> Hi
>>
>> The way I understand this (I welcome corrections), is that
>> copilot is a piece of proprietary software which was built
>> using a corpus of software hosted on github.
>>
>> And if I understand it properly, this corpus of software
>> includes code nominally released under the GPL and similar
>> licenses. Is my understanding correct sofar ?
>>
>> I am told that copilot reproduces verbatim identifiable
>> chunks of the code that was loaded into its model. This
>> presents two interesting questions - is the model (which
>> is an algorithmic transformation of copyrighted material)
>> a new, rather than derivative work ? Are the fragments of
>> code reproduced by it sufficiently small to be considered
>> fair use ?
>>
>> If I had to guess, I would answer no to both of these
>> questions: The pasting in is verbatim and automated,
>> and while individual pieces might not be large, there are
>> many. Fundamentally, I also would like to think that being
>> creative is something only a human/real intelligence
>> can be, otherwise the guy who wrote a program to enumerate
>> all melodies shorter than N would be owed sampling fees on
>> all new music... and few notes of music require a royalty
>> then a snippet of code might too ? If somebody writes an
>> "autocompose" equivalent to copilot, trained on existing
>> music, then they could expect legal action from record
>> labels ? And not only for the output of the "autoimprovise"
>> but also the model that forms the core of "autoplay" ?
>>
>> But I am not qualified to provide a final answer on
>> that - I suppose judges will be the ones to make that
>> determination. Note the plural there - I am told fair
>> use and its equivalents differ nontrivially between
>> countries...
>>
>> But assuming a no to those two points, github would then
>> (I think) rely on the terms and conditions on its site
>> which state that by uploading code to its site, you give
>> it the license to re-use your work, even in closed-source
>> products, apparently ?
>>
>> Now: If you upload somebody else's GPLed code to github
>> that then implies that you (and maybe github) have
>> contravened the license, right ? And possibly that means
>> you lose your own permission to hold a copy of the GPLed
>> code ? Alternatively, if you are the legitimate owner of
>> the code, you have just dual licensed your code (with a
>> commercial grant to github, owned by microsoft) which is
>> probably not what you intended ?
>>
>> Note the question marks everywhere - have I gotten my
>> facts wrong ? Is my reasoning incorrect ? Would it be
>> prudent to avoid using github for (A/L/)GPLed code until
>> the legalities have been settled ?
>>
>> regards
>>
>> marc
>> _______________________________________________
>> Discussion mailing list
>> Discussion at lists.fsfe.org
>> https://lists.fsfe.org/mailman/listinfo/discussion
>>
>> This mailing list is covered by the FSFE's Code of Conduct. All
>> participants are kindly asked to be excellent to each other:
>> https://fsfe.org/about/codeofconduct
> _______________________________________________
> Discussion mailing list
> Discussion at lists.fsfe.org
> https://lists.fsfe.org/mailman/listinfo/discussion
>
> This mailing list is covered by the FSFE's Code of Conduct. All
> participants are kindly asked to be excellent to each other:
> https://fsfe.org/about/codeofconduct
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fsfe.org/pipermail/discussion/attachments/20210712/4787c1f1/attachment.htm>


More information about the Discussion mailing list