Copyright Protection in LLM AI Training Part 2

Jan 27, 2025
12 min read

Note: In the first part the author has discussed how GenAI uses proprietary data of publishers to train its AI model and has elaborated upon the arguments used in the American jurisdiction. The second part delves into the Indian landscape in light of the ANI v. Open AI litigation.

The intersection of artificial intelligence and copyright law has emerged as a critical frontier in legal and technological discourse, exemplified by the ANI v. OpenAI case. This landmark case, the first of its kind in India, addresses pressing concerns about AI models infringing copyright through training datasets. It delves into foundational questions of unauthorized reproduction, data storage, and the applicability of doctrines like transformative and extractive use under Indian law. As global debates intensify, the case brings into focus the challenges posed by generative AI to intellectual property regimes while highlighting the need for robust legal frameworks to balance innovation with the protection of creators’ rights.

Indian Jurisdiction, Fair Use Doctrine and AI

ANI v. Open AI - The Case till Now

This is the first case relating to AI models and Copyright infringements through training datasets in India. The case is similar to the American actions. The case involves Three causes of action and they are as follows:

ANI argues that OpenAI infringed its copyright by storing, using and making copies of its copyrighted materials for training purposes without permission. The fact that the content is publicly accessible does not negate the need for OpenAI to obtain permission to use the material.
ANI asserts that ChatGPT generates responses that are either verbatim or substantially similar to its own content. This raises concerns about potential violations of copyright law and unauthorised use of ANI’s intellectual property.
ANI points to instances where ChatGPT has produced responses that attribute interviews or reports to individuals or organisations in a misleading or false manner.

In line with its arguments in the Times case Open AI’s defence is that ANI’s material is publicly available and can be blocked via the “Robots.txt” protocol if ANI does not want it accessed. (The Robots.txt protocol blocks web crawlers of Open AI, the same is dealt in detail in subsequent sections) OpenAI asserts its practices are transparent, with no court ruling against it for copyright infringement in multiple countries, including the US, Canada, and Germany. It claims that its model generates responses based on various data sources and does not reproduce ANI’s material verbatim.

Additionally, OpenAI denies accessing subscription-based content and argues that ANI has not provided specific evidence of infringement. The company maintains that it respected ANI's blocklist and continues to produce material based on publicly available data. However, as seen in the previous sections of this blog, ChatGPT has provided identical excerpts from paywalled articles. Although unproven by ANI as of now, Open AI’s statement regarding paywall articles is weak, especially considering the outputs of its products which have been placed on record in the U.S.

An interim order was passed by the court issuing notice and summons to Open AI. The court acknowledged ANI's placementof OpenAI's crawlers on a blocklist. The hearing is scheduled for 28 January 2025, giving both parties time to prepare their arguments. These arguments would primarily deal with reproducing, storing and extracting the copyrighted content. The following section examines how these different arguments may be approached under the Indian Copyright regime.

Unauthorized Storage by AI: The Two Views on storage

Training generative AI (genAI) models involves copying and storing data, including copyrighted works, to extract information. This process can occur in three ways:

Permanent Storage: Data is stored throughout the model's lifecycle.
Temporary Storage: Data is stored only until it is absorbed by the model.
No Storage: Federated or collaborative learning uses decentralized servers, avoiding centralized data storage.

The Indian Copyright Act gives copyright owners the exclusive right to reproduce works, including storing them electronically. However, the law does not explicitly define "reproduction" or "copying." Courts interpret reproduction as storing or duplicating the expressive form of a work, not merely ideas or meta-information extracted from it.

This distinction is critical in GenAI training. The process does not aim to enjoy or expose the work as humans would but instead tokenizes the data to train models. Hence it is a non-expressive use but only extraction of meta data to train AI. Such use may not substitute or exploit the work's primary market, a key concern of copyright law.

Courts must decide whether to interpret reproduction narrowly (protecting expressive use) or broadly (covering all storage and copying). Current legal principles, such as the idea-expression dichotomy and limiting doctrines like de minimis use, suggest that genAI training might not violate copyright if it doesn't expose or exploit the expressive form of the work. However, a broad interpretation would mean that AI training violates the “storage rights” as an aspect of IPR.

Transformative Use (The Non-Application of the American Argument to India)

The NYT case adopted the doctrine of Transformative use. Indian regime does not recognize the transformative use exception to copyright infringement within the parameters of Section 52 of the Copyright Act. However, the Division Bench of the Delhi High Court in University of Cambridge v. BD Bhandari [2011 SCC OnLine Del 321 has held use of a work for purposes of making a guidebook to be a substantially different purpose from the purpose for which the original work of the Plaintiff was made.

Although the Court recognised this purpose to be a transformative purpose, which did not impinge upon the expressive purpose for which the Plaintiff had an exclusive reproduction right. The key element of the judgement was that the reproduction right, or its scope was restricted by the Court to the expressive purpose for which the original work was curated. This is far from the idea of transformative use in its American sense.

According to Leval J. transformative use is a threshold inquiry to weed out garden variant infringements. It is a doctrine meant to ease the work of the courts in cases where the infringement is so very non-transformative that further inquiry of the 4-factors is unnecessary. A substantially different purpose cannot logically be extended to a transformation. This is because a transformation requires a complete revamp of the original content, in every manner through which it is perceived. Hence, a homage to a song by making a parody of it does not become transformative. Therefore, the Bhandari judgement (Supra) cannot be said to bring the entire extant of the legislative force of transformative use as is found in the U.S.

Extractive Use Doctrine

Another argument for fair use is the extractive use doctrine. The Division Bench of the Delhi High Court in Akuate Internet Services Pvt. Ltd. v. Star India Pvt. Ltd. <2013 scc online del 3344> held that copyright does not extend to monopolizing information, facts, or knowledge embedded within a protectable work. It emphasized that such protection would impede the dissemination of information, a critical component of Article 19(1)(a) of the Indian Constitution. This principle was also affirmed in Wiley Eastern Ltd. v. Indian Institute of Management <61(1996)dlt 281>.

This is reflective of the NYT and ANI case. The publishers claim copyright of their content claiming ChatGPT “Mimics their expression” by stealing their content. However, ChatGPT merely provides information. How that information is sourced is a different question from how that information is presented. ChatGPT arguably offers information such as facts, knowledge and information embedded within the news released by these publishers. If NYT unveiled a governmental scam, then their expression of that scam may be protected, however, the information of the people involved, knowledge of the crimes committed etc. cannot be protected under copyright law. They are mere extractive uses of the works. Therefore, Indian copyright law follows the idea-expression dichotomy, recognizing that while the expressive form of a work is protectable, the embedded information is not.

Now, referring to earlier sections of this blog it is clear that ChatGPT does not “express”. It merely reads the Meta data. Meta data is part of the information embedded in the content. Therefore, it is a weak argument to state that ChatGPT steals the expression of NYT. The analysis of such uses often hinges on whether access to the protected expression serves to extract informational content or to reproduce the expression. Extracting ideas may necessitate accessing and copying the entire protected work, but this does not extend copyright protection to unprotectable elements, as copyright does not grant a “right to control access.” Instead, it only protects against reproducing or adapting the expressive form.

Anti- Circumvention Provisions and AI

Section 65A (1) of the Indian Copyright Act prohibits circumvention of technological protection measures (For Ex: Paywalls), making unauthorized access actionable. However, Section 65A (2) permits circumvention for lawful purposes, including those under Section 52 (fair use), provided the facilitator maintains detailed records of users and purposes. This ensures access for permissible uses, such as transformative or extractive purposes. Therefore, Section 65A (2) exists to stop individuals from circumventing the fair use provisions of the copyrights Act by hiding all their works behind a paywall.

This links back to the assertions made by NYT and ANI that illegal access to their paywalled content by ChatGPT is a violation of Copyrights law. As highlighted by the doctrine of extractive use, information, knowledge etc. is not protected under copyright law. If the content behind the paywall is information or knowledge, extraction of the same does not amount to a violation. Therefore, by putting such information behind a paywall instead of merely the expressive work of their reporters the publishers are in potential violation of the anti-circumvention rules.

Reproduction Rights

Under Section 14 of the Indian Copyright Act, the Reproduction Right protects against substantial similarity that substitutes the original work’s primary market. As per R.G. Anand v. Deluxe Films (1978), infringement occurs only if the output unmistakably resembles the original work as a whole. Generative AI outputs are unlikely to meet this threshold unless they regurgitate training data, possibly via prompt injections (intentional prompts that force the AI to regurgitate).

The Reproduction Right does not protect basic themes, styles, or generic storylines, which are unprotected ideas. Thus, AI outputs infringe only if they substantially reproduce copyrighted expressions or significant, recognizable fragments.

Adaptations Rights

Under Indian copyright law, Generative AI outputs are unlikely to infringe the Adaptation Right unless they closely replicate the original work's core expression in a different format or involve only minor alterations insufficient to transform the work’s character. Indian courts, such as the Calcutta High Court in Barbara Taylor Bradford v. Sahara Media, interpret "alteration" narrowly, ensuring transformative works remain protected and distinct from reproduction. Unlike the broader "Derivative Works" doctrine in the U.S., India aligns with the Berne Convention, focusing on medium or format changes without overreaching into transformative uses. This approach emphasizes competition and innovation, fostering independent creation without overextending copyright protections to hinder AI-generated outputs that enrich cultural domains.

Other Developments in India

Press Information Bureau: Existing IPR Regime well equipped to Protect AI Generated Content, no need to create a separate category of rights U/S 52 (Feb 2024)

The exclusive economic rights of a copyright owner, such as the right of reproduction, translation, and adaptation, granted by the Copyright Act, 1957, require users of Generative AI to obtain permission for commercial use of their works if such use is not covered under the fair dealing exceptions under Section 52 of the Copyright Act. Intellectual property rights, being private rights, are enforced by individual rights holders, with adequate and effective civil measures and criminal remedies prescribed under the Copyright Law against infringement or unauthorized use, including digital circumvention. This information was provided by the Union Minister of State for Commerce and Industry, Shri. Som Parkash, in a written reply in the Rajya Sabha.

Digital News Publishers’ Association – Letter to MeiTY Dated 26.01.2024

The Digital News Publishers Association (DNPA) has written a letter to MeiTY to introduce copyright protections against generative AI models, emphasizing fair compensation for news content used to train AI. Drawing attention to international cases like The New York Times lawsuit against OpenAI and Microsoft for alleged copyright infringement, the DNPA highlighted the need for updated laws as India develops its own large language models, such as Ola’s Krutrim. In response to the letter, union IT Minister Rajeev Chandrasekhar acknowledged the significance of this issue, advocating for content creators' rights to monetized value and calling for legislative discussions to address the impact of AI on digital news platforms.

MeiTY Advisory – 15.03.2024

The advisory aims to enforce regulations for AI intermediaries. They state that every intermediary and platform should ensure that use of AI (LLM/ Generative) should not permit its users to host, display, upload, modify, publish, transmit, store, update or share any unlawful content as mentioned in Rule 3(1)(b) of the IT Rules.

IT rule 3(1)(b)(i) states that intermediary shall not allow on its “computer resource” (both by itself, and through its users) the hosting, display storage or sharing of any information that belongs to another person and to which the user has no right. Whereas, Rule 3(1)(b)(iv) makes “infringement of trademark, copyright or other proprietary rights” as an unlawful activity under the advisory.

The term “Computer Resource” has been defined u/s 2(k) of the IT Act as a “Computer, computer system, computer network, data, computer data base, or software.” AI, is a software and hence falls within the ambit of this guideline. The advisory is not a guideline and must be mandatorily adhered to making AI copyright regulation a mandatory aspect by operation of law.

MeiTY’s Report on AI Governance Guidelines Development

The Ministry of Information and Technology has already recognized this problem in December 2024. The sub-committee report aimed to find lapses in AI governance. In its effort it amongst other things focused on intellectual property rights (IPR) issues in AI, particularly generative AI, under Indian copyright law. In that it examines two areas:

Training AI on Copyrighted Data: Using copyrighted data for training AI models may infringe on copyright unless the law’s narrow exceptions apply. The report questions how to enforce compliance and who would be liable for infringements when multiple parties are involved.
Liability and Policy Questions: It raises concerns about who should be responsible for AI-generated infringements and whether AI can train on bulk copyrighted data without explicit permission, suggesting the need for clearer policies and safeguards.

How Can the Publishers Safeguard Their Interests?

The question of whether the use by Open AI of their data is a violation of their copyright is still before the courts in all major global jurisdictions. However, publishers and content creators have multiple remedies they can adopt to safeguard their data from being crawled or used by such GenAI models.

Technical Safeguards:

Open AI provides multiple methods through which publishers may safeguard their data. Firstly, publishers may employ the Robots.txt protocols in their websites. This protocol blocks all crawlers which are used by Open AI to surf the web and collect data for training their LLM models. Although, concerns exist as to the effectivity of these protocols, this is a primary safeguard against Open AI and its operations.

Secondly, publishers must make sure that their content is protected from being reproduced on third party websites. It is the duty of the publishers to ensure that they stay vigilant and clamp down on all unauthorized reproductions of their works, paywalled or otherwise. This ensures that Open AI will not accidentally use this data for training its models.

Publishers must also rely on the opt-out mechanism provided by Open AI. This mechanism allows publishers to register itself with Open AI and allows the latter to block the URLs of the publishers from its crawlers and training data sets.

Legal Safeguards:

Publishers must report any violation of their copyrighted content at the earliest. Open AI specifically states in its terms of use that if any copyright owner finds its content being used Open AI must be intimated. Post this, if the claim is genuine and ownership is proved Open AI will block the said URL from its training data. Lastly, if the publishers aim to sue Open AI or any similar platform, they must also implead the user of the query which produced their data to prove copyright infringement.

How can Open AI Secure its Interests?

In the present case, Open AI has already secured its interests on multiple grounds. In addition to the legal arguments on transformative works and application of section 52 (supra) Open AI has also absolved itself of all liability in its terms of use and service agreement.

As per the terms of use of Open AI the user is responsible for all the inputs and outputs that it receives from Open AI. The user must ensure that it has all the rights and permissions for using the content it generates on Open AI and that the use of such content is in consonance with the terms and all applicable laws. This naturally includes copyright laws of the country. Furthermore, Open AI also assigns all rights of the content generated to the user and holds no ownership in it. Therefore, Open AI is not the owner or user of the content it generates.

Interestingly, despite the fact that in light of these terms, the data output which is being used to prove infringement does not belong to ChatGPT but rather to its users no case against Open AI has impleaded the users of Open AI as parties. Adding on to this, the terms also require that all users must have the appropriate licenses and permissions needed to provide any input to ChatGPT further necessitating impleadment.

By operation of these terms, Open AI is a mere researching tool. It provides information as is with a disclaimer that it may be wrong. It does not claim to be a fact-finding body or a journalistic agency. It advertises itself solely as a “human like query answerer” with no ability to express anything. How this information is then used by the user is none of its concern or liability.

Lastly in view of Section 52(1)(a), Open AI must ensure that there is no commercial objective in generating content, with this specified in the terms and conditions. A defense under Sections 52(1)(g) and (m) may be used for training based on factual reporting. Aggregating content for general consumption on pertinent subjects, like quoting reports, can be permissible as fair use. The operation of Section 52(1)(a) will depend on the findings of the courts regarding the effect of ChatGPT on the relevant market of the publishers. The answer to this is yet to be seen, especially in light of the arguments made in the previous sections.

In conclusion, the intersection of generative AI models like ChatGPT and copyright law remains a highly complex and evolving domain, as evidenced by ongoing litigations and regulatory efforts globally and in India. While transformative and extractive use doctrines provide potential defenses under copyright law, their application varies significantly across jurisdictions. Publishers are actively seeking technical and legal safeguards to protect their content, while OpenAI relies on user-centric terms of use and the fair use doctrine to argue that its practices fall within permissible boundaries. Ultimately, the resolution of these disputes will depend on judicial interpretation of fair use, reproduction rights, and the market effects of AI outputs, shaping the future of AI training practices and intellectual property law.

Author:

Sarthak K

, in case of any queries please contact/write back to us via email to chhavi@khuranaandkhurana.com or at Khurana & Khurana, Advocates and IP Attorney.