AI Training Data under India’s DPDP Regime: Compliance Challenges and Strategies
- seo835
- 1 day ago
- 9 min read
Introduction
The foundation of AI systems is built on the data collected from various sources. The richness and size of datasets is usually correlated with the performance of AI models. Yet, the majority of the data used to develop these datasets contains Personally Identifiable Information (PII), including text, photos, videos, and digital media, and will likely be used in great amounts when developing AI systems. As a result, the Digital Personal Data Protection Act of 2023 (“DPDP Act”) and its corresponding Rules will govern how applicable PII can be collected, used, or processed by entities in India. The Act sets out certain obligations and duties for organisations or entities who collect, maintain, or process such data.
The major hurdle for organisations developing or using AI model systems is the requirement to adhere to regulations as they continue to proactively pursue innovative technology development. This article outlines the primary compliance issue, that is, how DPDP laws impact the AI training data. Additionally, it also examines the practical compliance strategies that organisations can take to mitigate compliance risk while achieving the necessary balance between complying with regulatory obligations and developing innovative AI systems.
Why the DPDP Laws matter for AI Training Data
Any data pertaining to an identifiable individual that is processed digitally is subject to the DPDP Act, 2023. This covers both, the data directly gathered via internet channels and offline data which is subsequently digitized as provided under Section 3, read with Section 2(1) (t) of the DPDP Act. This is especially important for AI systems because training datasets frequently contain personal information even when the dataset is not intended to identify individuals. There are instances that personal information has been accidentally incorporated into datasets used to train AI systems through text, audio, images, etc.
Importantly, the DPDP Act does not exclude personal information just because it is accessible to the public. Depending on the nature and the context in which the data is processed, information obtained via public websites or online platforms may nonetheless be considered personal data. Since they determine the purpose and means of processing training data, the AI developers and organizations that deploy AI systems are usually considered “Data Fiduciary” under the Act, 2023. Therefore, even in cases where data is sourced from third parties or processed through external infrastructure, they are legally obligated to ensure compliance. No doubt that the DPDP Act does not specifically provide for the data processing by AI models, but from reading of definition of “automated” provided under Section 2(1) (b) it is clear that that digital processing means automated processing as well.[1]
Data protection is now a direct operational factor for AI training and development, thanks to the DPDP Rules which further define how these duties must be carried out in practice.
Does publicly available data mean “free to use”?
In AI development, it is a prevalent fallacy that data accessible to the public is automatically exempted from data protection laws. This basis is not supported by the DPDP Act. From conjunctive reading of Section 3 with Section 2(1) (t), it is well settled that all digitally processed data relating to an identifiable person is covered, and public access does not automatically legitimize all downstream uses of personal data, particularly where such data is aggregated or repurposed for AI training beyond the original context in which it was made public. Further, while Section 3 (c) (ii) excludes personal data voluntarily made public by the data principal from the application of the Act, the extent of permissible downstream reuse, particularly for large-scale, commercial AI-training, remains legally unsettled.[2]
Data may still be considered personal data under the Act even if it is available on public websites or online platforms. The individual's reasonable expectations, the original context in which the data was disclosed, and the intended use of the data are all still pertinent. Because of this, scraping and using publicly accessible content, like profile pictures, posts, or comments, to train AI models may raise issues with the compliance, especially if the data is used for commercial AI applications or for purposes other than the original context in which it was shared.[3]
The Research Exemption under Data Protection Regime
Subject to certain requirements as prescribed under the DPDP Rules, the DPDP Act under Section 17 (2) (b) permits some relaxations for processing personal data for statistical purposes, research, or archiving. This is commonly referred to as the research exemption. Crucially, this exception is not automatic. It usually only applies where the protections outlined in the DPDP Rules and Second Schedule attached therewith are adhered to, and when the processing does not lead to decisions being made regarding particular individuals. Rule 16 read with Second Schedule provides in detail as to what are the standards for processing personal data which are necessary for research, statistical purposes, or archiving. Similar provisions may be find in European ‘General Data Protection Regulation (GDPR)’ under Article 5(1) (b) read with Article 89 (1).
![[Image Sources: Shutterstock]](https://static.wixstatic.com/media/3f05e9_1808eddc1b3a4dd380da9dad93c2495c~mv2.png/v1/fill/w_85,h_46,al_c,q_85,usm_0.66_1.00_0.01,blur_2,enc_avif,quality_auto/3f05e9_1808eddc1b3a4dd380da9dad93c2495c~mv2.png)
Therefore, developing an AI model solely for internal or academic study may be covered under the exemption. However, the processing might not be covered if the same information or model is later used for individual-level decision-making, profiling, or tailored recommendations. Thus, organizations that depend on this clause should exercise caution and keep precise records detailing:
(a) the reason for processing,
(b) the characteristics and extent of the dataset, and
(c) any limitations on commercial or downstream use.
Key compliance challenges for AI developers
(a) Identifying personal information in huge datasets
Large, complex, and unstructured datasets are frequently used for AI training. Personal information may be embedded in text, images, audio files, or hidden within metadata and combinations of information that can be used to identify a person. Large-scale manual assessments are challenging, but the failing to identify personal data increases exposure under the DPDP Act. When we look into the European Regulations under GPDR, Article 35 provides for Data Protection Impact Assessment wherein the procedure is prescribed a systematic monitoring of a publically accessible processing on a large scale personal data. It will thus force early identification of personal data risks, evaluate the safeguards before deployment, and assess re-identification.
(b) Consent at scale
Consent is a key component of DPDP Act’s legal justification for processing personal data. It might not be feasible for AI developers using third-party or web-scraped datasets to get each data principal’s valid consent. Organisations must carefully consider whether another legal foundation exists in these situations and maintain transparent records outlining why consent was not needed or how it was obtained. Under GDPR, there are multiple lawful basis and risk-balancing mechanisms to address the issue of consent wherein obtaining the consent due to large-scale data maybe give away with in circumstances such as, legitimate interests, public interests, research purposes as provided under Article 6.
(c) Vendor and cloud dependency
Multiple third parties are frequently involved in AI development, such as cloud service providers, external model providers, and annotation vendors. Even in cases when processing is outsourced, the principal data fiduciary is nonetheless accountable for compliance under the DPDP Act. Therefore, inadequate contracts or weak vendor supervision may lead to direct accountability. The GDPR addresses vendor and cloud dependency by retaining ‘data controller’ (the entity that determines the means and purpose of processing) accountability and by enforcing mandatory processor contracts (Art. 28), restricted processing instructions (Art. 29), and regulated cross-border transfers (Art. 44-49).
(d) Data security and breach risks
Training datasets frequently include sensitive and valuable information. Data fiduciaries must adhere to data breach notification obligations and put in place reasonable security measures as required under the DPDP Act. A security breach involving training data may result in legal action and serious reputational damage. The GDPR manages data security and breach risks by mandating risk-based security safeguards (Art. 32), imposing strict breach notification to the supervisory authority and documentation requirements thereof (Art. 33 and 34), and extending security duties across controllers and processors (Art. 24 and 28).
Practical compliance strategies
(a) Map your data before you train
Organizations should establish a comprehensive data inventory identifying the data’s source, the kinds of personal information it contains, its intended use, and any applicable retention periods before utilizing any dataset for AI training. This approach minimizes uncertainty later on and aids in determining which responsibilities under the DPDP Act apply.
(b) Follow data minimisation principles
Only the data that is genuinely required for training should be used by AI developers. Direct identifiers like names, emails, phone numbers should be eliminated whenever feasible, along with excessive or redundant metadata. While maintaining model utility, methods like pseudonymization and anonymization can further lower data protection concerns.
(c) Be cautious when relying on exemptions
Where organisations or entities rely on the research exemption under the DPDP Act, access to training datasets should be limited and the use of data for individual-level decision-making should be prevented. Clear constraints on downstream use, recorded purpose statements, and internal approvals all contribute to a compliance record that can be defended.
(d) Strengthen vendor contracts
AI development frequently relies on third-party service providers. DPDP-aligned data protection obligations, confidentiality and security requirements, specified data breach reporting timescales, and audit or inspection rights should all be included in contracts with cloud providers and data processors. In order to manage fiduciary obligation under the Act, strong contractual controls are essential.
(e) Maintain documentation and transparency
Dataset documentation and model cards that provide a clear explanation of data sources, known biases or restrictions, and the compliance measures adopted should be kept up to date by organizations. Internal governance, regulatory engagement, and investor or client due diligence all depend more and more on such documentation.
Looking ahead
The DPDP Act represents India’s first comprehensive framework for personal data protection, but it should be seen as a beginning rather than a comprehensive regulatory response to AI systems. AI-specific and sector-specific regulations are starting to emerge as the use of AI spreads throughout sectors like finance, consumer technology, healthcare, and education. The government's intention to move toward a more formal framework for regulating AI development and deployment is indicated by AI-focused legislation measures that were presented in Parliament last year. The DPDP Act's fundamental tenets: accountability, openness, risk assessment, and responsible data use, are anticipated to be strengthened by these activities.
Organizations will be better able to adjust to these changing regulatory requirements if they approach DPDP compliance as an early design and governance concern rather than a post-hoc legal activity. Early integration of privacy and data protection measures into AI training and deployment procedures also contributes to building trust of users, regulators, and business partners, which is becoming more and more important in a data-driven and AI-enabled environment.
Responsible AI, accountability, and transparency are becoming increasingly important in India's developing AI governance framework. The DPDP Act's regulations work in tandem with these broader policy goals, even if its primary concentration is on protecting personal data. As industry-specific AI legislation continues to evolve, organizations that incorporate privacy protections into AI design and training procedures now will be better equipped.
Conclusion
The DPDP Act, 2023 and its accompanying Rules mark a significant shift in how personal data used for AI training and deployment is regulated in India. For organisations developing or using AI systems, compliance with this paradigm is now a practical requirement that directly impacts the sourcing, processing, and governance of training data for organizations creating or utilizing AI systems. Without thorough legal and operational assessment, assumptions around public data, research exemptions, or outsourced processing are no longer adequate.
However, the DPDP regime does not aim to inhibit innovation. Rather, it encourages organisations to adopt responsible data practices, incorporate privacy safeguards at an early stage, and uphold accountability throughout the AI lifecycle, from training to deployment. Organisations can reduce regulatory risk while preserving innovation by proactively mapping data, minimising personal information, strengthening vendor oversight, and maintaining clear documentation.
As India’s AI governance framework continues to evolve alongside data protection law, DPDP compliance will increasingly serve as the baseline for future regulatory expectations. Organisations that align their AI strategies with these principles today will not only be better prepared for emerging AI-specific regulations but will also build greater trust with users, regulators, and business partners in an increasingly data-driven economy.
Author: Aditi Yadav, in case of any queries please contact/write back to us via email to chhavi@khuranaandkhurana.com or at Khurana & Khurana, Advocates and IP Attorney.
References
Ministry of Electronics and Information Technology, Digital Personal Data Protection Rules 2025, notified under the Digital Personal Data Protection Act, 2023 https://www.meity.gov.in/static/uploads/2025/11/53450e6e5dc0bfa85ebd78686cadad39.pdf
Ministry of Electronics and Information Technology, Digital Personal Data Protection Act, 2023, Act No. 22, Ministry of Law and Justice, Government of India (2023) https://www.meity.gov.in/static/uploads/2024/06/2bf1f0e9f04e6fb4f8fef35e82c42aa5.pdf.
India AI Governance Guidelines, Government guidance for responsible AI development and deployment, https://static.pib.gov.in/WriteReadData/specificdocs/documents/2025/nov/doc2025115685601.pdf?utm_source=chatgpt.com
National Strategy for Artificial Intelligence, India’s long-term policy vision for AI, Niti Aayog, https://www.niti.gov.in/sites/default/files/2023-03/National-Strategy-for-Artificial-Intelligence.pdf?utm_source=chatgpt.com
General Data Protection Regulation (GDPR) (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016.
[1]Usha Tandon & Neeral Kumar Gupta, Informational Privacy in the Age of Artificial Intelligence: A Critical Analysis of India’s DPDP Act, 2023, 6 Legal Issues in the Digital Age 87 (2025).
[2]Kshitij Malhotra, To Train or Not to Train: AI and the Data Privacy Dilemma, NLS L. & Tech. Forum (2025), available at https://forum.nls.ac.in/ijlt-blog-post/to-train-or-not-to-train-ai-and-the-data-privacy-dilemma/?utm_source=chatgpt.com
[3]Hannah Ruschemeier, Generative AI and Data Protection, forthcoming in: Carlo/Poncibo/Ebers/Zou, Handbook for Generative AI and the Law, Cambridge University Press (2025), available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4814999


