<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Also I experience something similar of CoPilot for the Mozilla

      Italia DeepSpeech Italian model

      (<a class="moz-txt-link-freetext" href="https://github.com/MozillaItalia/DeepSpeech-Italian-Model">https://github.com/MozillaItalia/DeepSpeech-Italian-Model</a>).</p>

    <p>When I studied how to deal with various audio+text/text-only

      italian datasets I talked a bit with other people of the Machine

      Learning community and also with some lawyers.<br>

      It is a grey area because the focus is that you are generating a

      model that don't let you to recreate the (in our case) original

      materials and also doesn't store (in some cases) the real trained

      content but just numbers. <br>

    </p>

    <p>So for the audio+text dataset we chosen to use only dataset with

      licenses that are CC or public domain (very few as a lot of them

      are academic and don't use license at all), instead for text-only

      we used various sources also with no licenses.<br>

      This because we are aggregating all the sources together, removing

      duplicates, symbols and other stuff so it is not possible to

      recreate the original material.<br>

      The model generated is released as CC0 and mention all the dataset

      used and for the text-only it is the same, also we release all the

      scripts to generate it but we don't release the files created

      during the parsing but just the final output.<br>

    </p>

    <p>Similar of what is doing the

      <a class="moz-txt-link-freetext" href="https://github.com/common-voice/cv-sentence-extractor">https://github.com/common-voice/cv-sentence-extractor</a> that is

      using Wikipedia and Wikisource as source but they pick for every

      article just 3 sentences randomly. I know that for the project was

      involved the Mozilla Legal team and also if Wikipedia is CC0 they

      preferred that way.<br>

    </p>

    <p>The issue I see with CoPilot, but also with Kite or TabNine, that

      are all services for autocomplete or auto write code trained on

      open source code, is they can recreate part of the code but not

      all of that but the law doesn't mention the amount of that.<br>

      As example in Italy is allowed to photocopy just the 20% of a book

      by the copyrights laws.<br>

    </p>

    <p>PS: the story of the italian project with a talk at fosdem 2020

<a class="moz-txt-link-freetext" href="https://archive.fosdem.org/2020/schedule/event/how_to_get_fun_with_teamwork/">https://archive.fosdem.org/2020/schedule/event/how_to_get_fun_with_teamwork/</a>

      or the written version

<a class="moz-txt-link-freetext" href="https://daniele.tech/2019/12/how-the-italian-deepspeech-model-helped-our-mozilla-italia-community/">https://daniele.tech/2019/12/how-the-italian-deepspeech-model-helped-our-mozilla-italia-community/</a><br>

    </p>

    <div class="moz-signature"><br>

      <small>Daniele Scasciafratte - OpenSource MultiVersal Guy<br>

        <a href="https://daniele.tech">daniele.tech</a> - <a

          href="https://twitter.com/Mte90net">@Mte90Net</a> - <a

          href="https://github.com/Mte90">GitHub</a> - <a

          href="http://www.ils.org/">Italian Linux Society council

          member</a> - <a href="https://people.mozilla.org/p/Mte90">Mozillian</a><br>

        Mozilla Reps, Mozilla TechSpeakers, <a

          href="https://profiles.wordpress.org/mte90">WordPress Core

          Contributor</a>, <a href="https://fsfe.org/">FSFE member</a>,

        <br>

        <a href="http://www.libreitalia.it/soci/">LibreItalia member</a>,

        <a href="https://www.wikimedia.it/">Wikimedia Italia member</a>

        and <a href="http://lugrieti.linux.it/">LUG Rieti founder</a>.</small><br>

    </div>

    <div class="moz-cite-prefix">Il 12/07/21 11:12, Paul Schaub ha

      scritto:<br>

    </div>

    <blockquote type="cite"

      cite="mid:c1067417-92b8-6925-6855-d12c709a822c@fsfe.org">

      <pre class="moz-quote-pre" wrap="">Hey,

while I can't answer your questions, here is an article by Julie Reda,

arguing that Copilot is not in fact infringing copyright and that the

copyleft movement would not benefit from stricter copyright rules:

<a class="moz-txt-link-freetext" href="https://juliareda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/">https://juliareda.eu/2021/07/github-copilot-is-not-infringing-your-copyright/</a>

Regarding your question about music, there is an interesting provocative

project:

Fairuseify.ml uses a neural network to "learn" music that you upload to

it. You can then download what the network "learned" (which in my

experiments pretty much sounds like what you uploaded).

<a class="moz-txt-link-freetext" href="https://fairuseify.ml/">https://fairuseify.ml/</a>

I'll watch this debate closely.

Paul

Am 10.07.21 um 10:58 schrieb marc:

</pre>

      <blockquote type="cite">

        <pre class="moz-quote-pre" wrap="">Hi

The way I understand this (I welcome corrections), is that

copilot is a piece of proprietary software which was built

using a corpus of software hosted on github.

And if I understand it properly, this corpus of software

includes code nominally released under the GPL and similar

licenses. Is my understanding correct sofar ?

I am told that copilot reproduces verbatim identifiable

chunks of the code that was loaded into its model. This

presents two interesting questions - is the model (which

is an algorithmic transformation of copyrighted material)

a new, rather than derivative work ? Are the fragments of

code reproduced by it sufficiently small to be considered

fair use ?

If I had to guess, I would answer no to both of these

questions: The pasting in is verbatim and automated,

and while individual pieces might not be large, there are

many. Fundamentally, I also would like to think that being

creative is something only a human/real intelligence

can be, otherwise the guy who wrote a program to enumerate

all melodies shorter than N would be owed sampling fees on

all new music... and few notes of music require a royalty

then a snippet of code might too ? If somebody writes an

"autocompose" equivalent to copilot, trained on existing

music, then they could expect legal action from record

labels ? And not only for the output of the "autoimprovise"

but also the model that forms the core of "autoplay" ?

But I am not qualified to provide a final answer on

that - I suppose judges will be the ones to make that

determination. Note the plural there - I am told fair

use and its equivalents differ nontrivially between

countries...

But assuming a no to those two points, github would then

(I think) rely on the terms and conditions on its site

which state that by uploading code to its site, you give

it the license to re-use your work, even in closed-source

products, apparently ?

Now: If you upload somebody else's GPLed code to github

that then implies that you (and maybe github) have

contravened the license, right ? And possibly that means

you lose your own permission to hold a copy of the GPLed

code ? Alternatively, if you are the legitimate owner of

the code, you have just dual licensed your code (with a

commercial grant to github, owned by microsoft) which is

probably not what you intended ?

Note the question marks everywhere - have I gotten my

facts wrong ? Is my reasoning incorrect ? Would it be

prudent to avoid using github for (A/L/)GPLed code until

the legalities have been settled ?

regards

marc

_______________________________________________

Discussion mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Discussion@lists.fsfe.org">Discussion@lists.fsfe.org</a>

<a class="moz-txt-link-freetext" href="https://lists.fsfe.org/mailman/listinfo/discussion">https://lists.fsfe.org/mailman/listinfo/discussion</a>

This mailing list is covered by the FSFE's Code of Conduct. All

participants are kindly asked to be excellent to each other:

<a class="moz-txt-link-freetext" href="https://fsfe.org/about/codeofconduct">https://fsfe.org/about/codeofconduct</a>

</pre>

      </blockquote>

      <pre class="moz-quote-pre" wrap="">_______________________________________________

Discussion mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Discussion@lists.fsfe.org">Discussion@lists.fsfe.org</a>

<a class="moz-txt-link-freetext" href="https://lists.fsfe.org/mailman/listinfo/discussion">https://lists.fsfe.org/mailman/listinfo/discussion</a>

This mailing list is covered by the FSFE's Code of Conduct. All

participants are kindly asked to be excellent to each other:

<a class="moz-txt-link-freetext" href="https://fsfe.org/about/codeofconduct">https://fsfe.org/about/codeofconduct</a>

</pre>

    </blockquote>

  </body>

</html>