User:Virgocat444/Artificial intelligence and copyright

= Artificial intelligence and copyright = In the 2020s, the rapid advancement of deep learning-based generative artificial intelligence models are raising questions about whether copyright infringement occurs when the generative AI is trained or used. This includes text-to-image models such as Stable Diffusion and large language models such as ChatGPT. As of 2023, there are several pending US lawsuits challenging the use of copyrighted data to train AI models, with defendants arguing that this falls under fair use.

Popular deep-learning models are trained on mass amounts of media scraped from the Internet, often utilizing copyrighted material. When assembling training data, the sourcing of copyrighted works may infringe on the copyright holder's exclusive right to control reproduction, unless covered by exceptions in relevant copyright laws. Additionally, using a model's outputs might violate copyright, and the model creator could be accused of vicarious liability and held responsible for copyright infringement.

Training AI with copyrighted data
Deep learning models source large data sets from the Internet such as publicly available images and the text of web pages. The text and images are then converted into numeric formats the AI can analyze. A deep learning model identifies patterns linking the encoded text and image data and learns which text concepts correspond to elements in images. Through repetitive testing, the model refines its accuracy by matching images to text descriptions. The trained model undergoes validation to evaluate its skill in generating or manipulating new images using only the text prompts provided after the training process. Because assembling these training datasets involves making copies of copyrighted works, this has raised the question of whether this process infringes the copyright holder's exclusive right to make reproductions of their works.

US machine learning developers have traditionally believed this to be allowable under fair use because using copyrighted work is transformative, and limited. The situation has been compared to Google Books's scanning of copyrighted books in Authors Guild, Inc. v. Google, Inc., which was ultimately found to be fair use, because the scanned content was not made publicly available, and the use was non-expressive.

Timothy B. Lee, in Ars Technica, argues that if the plaintiffs succeed, this may shift the balance of power in favour of large corporations such as Google, Microsoft, and Meta which can afford to license large amounts of training data from copyright holders and leverage their proprietary datasets of user-generated data. IP scholars Bryan Casey and Mark Lemley argue in the Texas Law Review that datasets are so large that "there is no plausible option simply to license all of the (data)... allowing (any generative training) copyright claim is tantamount to saying, not that copyright owners will get paid, but that the use won't be permitted at all." Other scholars disagree; some predict a similar outcome to the US music licensing procedures.

Several jurisdictions have explicitly incorporated exceptions allowing for "text and data mining" (TDM) in their copyright statutes including the United Kingdom, Germany, Japan, and the EU. Unlike the EU, the United Kingdom prohibits data mining for commercial purposes but has proposed this should be changed to support the development of AI: "For text and data mining, we plan to introduce a new copyright and database exception which allows TDM for any purpose. Rights holders will still have safeguards to protect their content, including a requirement for lawful access." As of June 2023, a clause in the draft EU AI Act would require generative AI to "make available summaries of the copyrighted material that was used to train their systems".

Copyright status
Since most legal jurisdictions only grant copyright to original works of authorship by human authors, the definition of "originality" is central to the copyright status of AI-generated works.

United States
The Copyright Act of 1976 protects "original works of authorship". The U.S. Copyright Office has interpreted this as being limited to works "created by a human being", declining to grant copyright to works generated solely by a machine.

Some have suggested that certain AI generations might be eligible for copyright in the US and similar jurisdictions if it can be shown that the human who ran the AI program exercised sufficient originality in selecting the inputs to the AI or editing the AI's output. Proponents of this view suggest that an AI model may be viewed as merely a tool (akin to a pen or a camera) used by its human operator to express their creative vision. For example, proponents argue that if the standard of originality can be satisfied by an artist clicking the shutter button on a camera, then perhaps artists using generative AI should get similar deference, especially if they go through multiple rounds of revision to refine their prompts to the AI. Other proponents argue that the Copyright Office is not taking a technology-neutral approach to the use of AI (or algorithmic) tools. For other creative expressions (music, photography, writing) the test is effectively whether there is de minimis human creativity. For works using AI tools, the Copyright Office has made the test a different one i.e. whether there is no more than de minimis technological involvement.

This difference in approach can be seen in the recent decision in respect of a registration claim by Jason Matthew Allen for his work Théâtre D'opéra Spatial created using Midjourney (and an upscaling tool) where the Copyright Office stated:"The Board finds that the Work contains more than a de minimis amount of content generated by artificial intelligence ('AI'), and this content must therefore be disclaimed in an application for registration. Because Mr. Allen is unwilling to disclaim the AI-generated material, the Work cannot be registered as submitted."As AI is increasingly used to generate literature, music, and other forms of art, the US Copyright Office has released new guidance emphasizing whether works, including materials generated by artificial intelligence, exhibit a 'mechanical reproduction' nature or are the 'manifestation of the author's own creative conception'. The US Copyright Office published a Rule in March 2023 on a range of issues related to the use of AI, where they stated: ...because the Office receives roughly half a million applications for registration each year, it sees new trends in registration activity that may require modifying or expanding the information required to be disclosed on an application.

One such recent development is the use of sophisticated artificial intelligence ("AI") technologies capable of producing expressive material. These technologies "train" on vast quantities of preexisting human-authored works and use inferences from that training to generate new content. Some systems operate in response to a user's textual instruction, called a "prompt."

The resulting output may be textual, visual, or audio, and is determined by the AI based on its design and the material it has been trained on. These technologies, often described as "generative AI," raise questions about whether the material they produce is protected by copyright, whether works consisting of both human-authored and AI-generated material may be registered, and what information should be provided to the Office by applicants seeking to register them.

United Kingdom
Some jurisdictions include explicit statutory language related to computer-generated works, including the United Kingdom's Copyright, Designs and Patents Act 1988, which states:"In the case of a literary, dramatic, musical or artistic work which is computer-generated, the author shall be taken to be the person by whom the arrangements necessary for the creation of the work are undertaken."However, the computer-generated work law under UK law relates to autonomous creations by computer programs. Individuals using AI tools will usually be the authors of the works assuming they meet the minimum requirements for copyright work. The language used for computer generated work relates, in respect of AI, to the ability of the human programmers to have copyright in the autonomous productions of the AI tools (i.e. where there is no direct human input):"In so far as each composite frame is a computer generated work then the arrangements necessary for the creation of the work were undertaken by Mr Jones because he devised the appearance of the various elements of the game and the rules and logic by which each frame is generated and he wrote the relevant computer program. In these circumstances I am satisfied that Mr Jones is the person by whom the arrangements necessary for the creation of the works were undertaken and therefore is deemed to be the author by virtue of s.9(3)"There are wide-ranging reviews of the use of AI and its impact on copyright. The UK government has consulted on the use of generative tools and AI in respect of intellectual property leading to a proposed specialist Code of Practice: "to provide guidance to support AI firms to access copyrighted work as an input to their models, whilst ensuring there are protections on generated output to support right holders of copyrighted work". The US Copyright Office recently published a Notice of inquiry and request for comments following its 2023 Registration Guidance.

China
On November 27, 2023, the Beijing Internet Court issued a decision recognizing copyright in AI-generated images in a litigation.

As noted by a lawyer and AI art creator, the challenge for intellectual property regulators, legislators, and the courts is how to protect human creativity in a technologically neutral fashion whilst considering the risks of automated AI factories. AI tools can autonomously create a range of material that is potentially subject to copyright (music, blogs, poetry, images, and technical papers) or other intellectual property rights (such as patents and design rights). This represents an unprecedented challenge to existing intellectual property regimes.

AI output copyright violations
A photograph of Anne Graham Lotz included in Stable Diffusion's training set

An image generated by Stable Diffusion using the prompt "Anne Graham Lotz"

In rare cases, generative AI models may produce outputs that are virtually identical to images from their training set. The research paper from which this example was taken was able to produce similar replications for only 0.03% of training images.

In some cases, deep learning models may "memorize" the details of particular items in their training set, and reproduce them at generation time, such that their outputs may be considered copyright infringement. This behaviour is generally considered undesirable by AI developers (being considered a form of overfitting), and disagreement exists as to how prevalent this behaviour is in modern systems. OpenAI has argued that "well-constructed AI systems generally do not regenerate, in any nontrivial portion, unaltered data from any particular work in their training corpus". Under US law, to prove that an AI output infringes a copyright, a plaintiff must show the copyrighted work was "actually copied", meaning that the AI generates output which is "substantially similar" to their work, and that the AI had access to their work.

Since fictional characters enjoy some copyright protection in the US and other jurisdictions, an AI may also produce infringing content in the form of novel works which incorporate fictional characters.

In the course of learning to statistically model the data on which they are trained, deep generative AI models may learn to imitate the distinct style of particular authors in the training set. For example, a generative image model such as Stable Diffusion is able to model the stylistic characteristics of an artist like Pablo Picasso (including his particular brush strokes, use of colour, perspective, and so on), and a user can engineer a prompt such as "an astronaut riding a horse, by Picasso" to cause the model to generate a novel image applying the artist's style to an arbitrary subject. However, an artist's overall style is generally not subject to copyright protection.

Existing litigation

 * A November 2022 class action lawsuit against Microsoft, GitHub and OpenAI alleged that GitHub Copilot, an AI-powered code editing tool trained on public GitHub repositories, violated the copyright of the repositories' authors, noting that the tool was able to generate source code that matched its training data verbatim, without providing attribution.
 * In January 2023 three artists — Sarah Andersen, Kelly McKernan, and Karla Ortiz — filed a class action copyright infringement lawsuit against Stability AI, Midjourney, and DeviantArt, claiming that these companies have infringed the rights of millions of artists by training AI tools on five billion images scraped from the web without the consent of the original artists. The plaintiffs' complaint has been criticized for technical inaccuracies, such as incorrectly claiming that "a trained diffusion model can produce a copy of any of its Training Images", and describing Stable Diffusion as "merely a complex collage tool". In addition to copyright infringement, the plaintiffs allege unlawful competition and violation of their right of publicity concerning AI tools' ability to create works in the style of the plaintiffs en masse. In July 2023, U.S. District Judge William Orrick inclined to dismiss most of the lawsuit filed by Andersen, McKernan, and Ortiz but allowed them to file a new complaint.
 * In January 2023, Stability AI was sued in London by Getty Images for using its images in their training data without purchasing a license.
 * Getty filed another suit against Stability AI in a US district court in Delaware in February 2023. The suit again alleges copyright infringement for the use of Getty's images in the training of Stable Diffusion and further argues that the model infringes Getty's trademark by generating images with Getty's watermark.
 * In July 2023, authors Paul Tremblay and Mona Awad filed a lawsuit against OpenAI, alleging that ChatGPT, OpenAI's language model, used their copyrighted books without permission. ChatGPT's accurate summaries of their works suggest unauthorized content use. This lawsuit highlights the battle between copyright owners and AI companies, potentially leading to discussions on copyright rules and data source disclosure.
 * In August 2023, in the case of Thaler v. Perlmutter, the U.S. Court of Appeals for the Federal Circuit ruled on the patentability of inventions created by an AI system. The case revolved around Stephen Thaler's use of his AI program, DABUS, in the creation of two inventions. Thaler argued that DABUS should be recognized as the inventor. The court upheld the U.S. Patent and Trademark Office's decision, stating that under U.S. law, only 'natural persons' can be named as inventors on patent applications. While USPTO has not challenged granting patent protection to AI inventions if the inventor on the application is a “natural person”, copyright protection is not granted to art produced by the machine without any creative input or invention from a human author. The USPTO published a rule in February 2024 affirming this ruling, but allowing human inventors to incorporate the output of artificial intelligence, as long as this method is appropriately documented in the patent application. However, it may become virtually impossible when the inner workings and the use of AI in inventive transactions are not adequately understood or are largely unknown.