Web Scraping and AI Training Data: Copyright Challenges in the Age of Generative AI

Apr 29
6 min read

Introduction

With the rapid development of Generative Artificial Intelligence (AI) and the surge in its usage, certain copyright concerns in the context of web-scraped data for training AI models become evident and relevant. These AI systems rely heavily on datasets which are often built using web scraping tools, which includes myriads of articles, blogs, published works, and images that are protected under copyright laws. Data scraping in context of Intellectual Property (IP) refers to extraction of copyrighted or proprietary content from websites or databases using bots to train the AI system and may carry IP infringement risks. The pressing legal question here is, “Does the scraping of copyrighted material to train AI systems constitute copyright infringement under the Copyright Act, 1957?”

Web Scraping and Copyright Law in India

The protection of literary, artistic and cinematographic work covered under the Copyright Act, 1957. According to the existing jurisprudence, the Indian law is silent and uncertain on the issue of data scraping.^[1] Under Section 14 of the Copyright Act, the copyright holders possess the exclusive rights to reproduce the work, store it electronically and communicate it to the public.^[2] In lieu of this right, the question of whether copying works for AI training qualify as “reproduction” under copyright law becomes of paramount importance. As discussed earlier, the Indian law remains silent on it, yet unauthorised copying and republishing may invite legal action. Besides, the doctrine of fair dealing as codified under Section 52^[3] of the Act, permits the use of copyrighted works for purposes such as research, private study, criticism, and review but does not cover large-scale commercial scraping conducted for training AI models and this may fall outside the scope of this exception. There are certain circumstances where data scraping may be subject to legal scrutiny, like situations where the scraping bypasses technological protection, involves substantial portions of copyright, and involves scraping of proprietary database contents.

Another problem that arises is the nature of the copying that occurs in the training process of the AI systems. Machine learning systems, for instance, make temporary copies of the information they are exposed to during the training process, if such copies are considered “reproduction” under copyright law, AI training could potentially infringe copyright unless covered by an exception. The Indian copyright law does not clearly address the issue of whether temporary copies constitute actionable copying. This uncertainty further complicates the application of traditional copyright doctrines to emerging data-driven technologies.

Global Litigation and Emerging Legal Debate

Globally, the issues related to AI data sets have reached the court, and the first notable case is NYT v. Open AI,^[4] where the newspaper has claimed that millions of its articles have been used without permission to train the generative AI models. The issue here is whether copying is considered infringement under copyright law, especially when it is for transformative technological use. However, commentators have noted that the current position of the newspaper is in contrast with their earlier litigation, in which it took the position that the legality of the digitization of articles for online databases was in issue, which again underscores the changing nature of the legal issues surrounding the new technological uses of copyrighted works.^[5] The two theories, largely the arguments of copyright holders and AI developers, appear to centre on the issue of transformative use, “reproduction” of copyrighted works, with AI developers arguing that pattern finding is what these models learn, not the works themselves. The courts world-wide are struggling to find one clear answer as copyright law was written way before the advent of advanced technology so the bare law is unable to accommodate this Techno-IP dichotomy, moreover it needs to be pointed out that even if the training requires copying, the final outcome is not identical to the source, something which blurs the boundary between analysis and reproduction.

Temporary Copies and The Problem of Ai Training

While the Copyright Act, 1957 provides a general framework for protecting creative works, it does not explicitly address the legality of using copyrighted material for machine-learning training purposes. Indian jurisprudence has in the past faced somewhat similar issues in the context of digital copying. The most notable of such cases was the dispute in the context of Google’s book digitization program, wherein concerns were raised about the scanning of copyrighted works for the purpose of creating databases for searching the same.^[6] Although the technological context differed from modern AI systems, the controversy similarly revolved around whether copying copyrighted material for technological processing constituted infringement or a permissible use. This clearly showcases how emerging technologies repeatedly test the limits of traditional copyright doctrines.

The growing trend towards data-oriented technologies, also points to a growing disparity between the regulation of this field. On one hand, the lack of regulation surrounding the scraping of copyrighted materials could potentially threaten economic rights of the holders. On the other hand, over-regulation of copyrighted materials might potentially curb the development of new technologies and artificial intelligence systems. In order to resolve this, it might be necessary to provide clearer regulations regarding the usage of copyrighted materials.

Comparative Legislative Approaches

There are a number of jurisdictions that have already initiated changes to their copyright laws to deal with the issue of the use of copyrighted works in the training datasets of artificial intelligence systems. To give an example, the EU Copyright Directive, which is officially known as Directive (EU) 2019/790, has provided text and data mining exceptions that permit the automated analysis of copyrighted works, subject to certain conditions. Moreover, Japan has a copyright system that permits the use of copyrighted works in data analysis, as provided by the Japanese Copyright Act, which shows a liberal attitude towards machine learning technology. These are some of the possible routes that India might follow to deal with the challenges posed by artificial intelligence training datasets under the Copyright Act, 1957.

Policy Challenges and Future Regulation

In the future, it may also be necessary for policymakers and the courts to consider the possibility of finding ways to strike an appropriate balance between the interests of creators and technology developers. In consideration of these advancements, India might benefit from a text and data mining exception similar to that which is currently provided under the EU Copyright Directive, which is now codified under Directive (EU) 2019/790. This would enable data mining of copyrighted materials for artificial intelligence training, which would, in turn, be subject to certain conditions that safeguard the rights of copyright holders. This might include allowing copyright holders to reserve their rights to be excluded from data mining through technological means. This would enable AI developers to have access to large amounts of data, which is necessary for machine learning, without denying copyright holders control over how their work is used for commercial purposes. This would be an important development, which would go some way towards addressing the current state of uncertainty surrounding AI training data and provide a more equitable balance between technology and intellectual property rights under Indian copyright law.

Conclusion

As the generative AI tools increasingly depend on web-scraped data, the fundamental conflict within the copyright legislation becomes evident. Although the Copyright Act, 1957 provides copyright holders exclusive rights and control over the production and utilisation of their works, it does not specifically address whether processing copyrighted content for machine learning purposes constitutes actionable copying. This legal ambiguity has become more relevant as AI developers increasingly rely on substantial sets of publicly accessible web content to support their applications. But at the same time, unrestricted web scraping also threatens to erode copyright holders' economic and moral rights. As recent global litigation and legislative developments worldwide indicate, existing copyright legislation is being tested by data-driven technologies. For India, however, the key challenge appears to be creating a legal framework that balances technological advancement with meaningful copyright protection. Interpreting the limits of lawful data mining and AI training within Indian copyright legislation may thus emerge as a critical factor in assessing the effectiveness of intellectual property legislation in meeting the requirements of a modern, technology-driven world.

Author: Moomal Joshi, in case of any queries please contact/write back to us via email to chhavi@khuranaandkhurana.com or at Khurana & Khurana, Advocates and IP Attorney.

^[1] Ada Shaharbanu and Sean McDonald, ‘Legality of Data Scraping Under Indian Law’ (Spice Route Legal, 1 August 2025) https://spiceroutelegal.com/publications/legality-of-data-scraping-under-indian-law/ accessed 6 March 2026.

^[2] Copyright Act 1957, s 14.

^[3] Copyright Act 1957, s 52.

^[4] 1:23-cv-11195, S.D.N.Y,

^[5] Harvard Law Review Blog, “NYT v. OpenAI: The Times’s About-Face” (Harvard Law Review Blog, April 2024), available at: https://harvardlawreview.org/blog/2024/04/nyt-v-openai-the-timess-about-face/.

^[6] Jonathan Band, ‘The Google Library Project: Both Sides of the Story’ (2006) Plagiary: Cross-Disciplinary Studies in Plagiarism, Fabrication, and Falsification https://quod.lib.umich.edu/p/plag/5240451.0001.002 accessed 6 March 2026.