Centre for Digital Public Infrastructure
english
english
  • THE DPI WIKI
    • πŸŽ‰About the DPI Wiki
    • πŸ”†What is DPI?
    • ✨DPI Overview
    • πŸ“DPI Tech Architecture Principles
      • πŸ”—Interoperability
      • 🧱Minimalist & Reusable Building Blocks
      • πŸ’‘Diverse, Inclusive Innovation
      • πŸ’ Federated & Decentralised by Design
      • πŸ”Security & Privacy By Design
    • 🎯DPI Implementation & Execution Guidance
    • πŸ†šDPG and DPI
    • ❓What DPI can I build?
    • πŸ₯‡First use case for DPI
    • πŸ“˜Inputs for designing a DPI informed digital transformation strategy
    • πŸ’°How much does it cost to build DPI?
    • πŸ“’Is my system a DPI?
      • TL; DR - Is my system a DPI?
  • Mythbusters and FAQs
    • πŸ”―DPI and Mandating Adoption
    • πŸ”―DPI and Private Competition
    • πŸ”―DPI and Privacy / Security
    • πŸ”―DPI and the Digital Divide
  • Technical Notes
    • πŸ†”Identifiers & Registries
      • Digital ID
        • Capabilities on ID system
        • ID-Auth
        • Face Authentication
        • eKYC/ Identity profile sharing
        • Single Sign On (SSO)
        • QR Code for Offline ID
    • πŸ“‚Data Sharing, Credentials and Models
      • A primer to personal data sharing
      • Data standards
      • Verifiable Credentials
      • Building Data Analytics Pipelines
      • eLockers
      • Non-personal Anonymised Datasets
    • πŸ”Trust Infra
      • Digital Signatures and PKI
      • eConsent
      • eSign
    • πŸ›’Discovery & Fulfilment
      • Platforms to Protocols
    • πŸ’ΈPayments
      • Financial Address
      • Interoperable QR Code
      • Interoperable Authentication
      • Interoperable Bill Payments
      • Cash in Cash Out (CICO)
      • Financial Address Mapper (G2P Connect)
      • G2P Payments
  • Initiatives
    • 🌐DPI advisory
    • πŸš€DPI as a Packaged Solution (DaaS)
      • πŸ’‘Why do we need DaaS?
      • 🎯DaaS in a nutshell
      • πŸ“¦Pre-packaged DaaS kits
      • ♻️Reusable DaaS Artefacts
      • 3️⃣A 3-step process from idea to implementation!
      • πŸ“ˆFunded DaaS Program overview
      • πŸ‘©β€πŸ’»Cohort 1: DaaS Offerings
        • Digital authentication
        • Digital credentials
        • ID Account Mapper
      • πŸ–₯️Co-create with us!
      • πŸ’¬Upcoming DaaS cohorts
        • Functional Registries
        • AI Assistant
      • ❓FAQs on DaaS
        • Country x DPG MOU /LoI FAQs
        • Ecosystem Participation Terms FAQs
    • πŸ“‘DPI Residents Program
    • βš–οΈDPI-CPA
    • πŸ’ΈG2P Connect
    • πŸ“¨User Centric Credentialing & Personal Data Sharing
    • βš•οΈDPI for Health
    • 🌍Agri Connect (forthcoming)
  • References
    • Glossary
    • Curated Specifications
  • Additional Info
    • 🀝Licensing
    • ✍️Contact Us
Powered by GitBook
On this page
Export as PDF
  1. Technical Notes
  2. Data Sharing, Credentials and Models

Non-personal Anonymised Datasets

Guidelines for for Decision Making & Research

PreviouseLockersNextTrust Infra

Last updated 11 months ago

Background

Publicly available anonymized datasets are collections of data that have undergone a process of data anonymization, which preserves the analytical and research value of data while maintaining the anonymity of any data subjects. The purpose of this process is to protect individuals' privacy by removing personally identifiable information (PII), such as names, addresses, and social security numbers, while still making available data containing important insights for policy, administrative, research or trends assessments across various sectors. Non-personal data includes the above as well as data that had no personal information to begin with (such as public GIS location, regional/national socio-economic indicators, weather or aggregated tax collections. ).

These datasets can be made freely available to the public to encourage innovation, promote transparency, or assist in scientific research. For example, public anonymized datasets can be useful in training machine learning (ML) models, analysing aggregated health data to understand disease patterns, devising informed care plans, and designing clinical trials for drug development. They facilitate algorithm training without privacy breaches, inform healthcare strategies, optimise trial protocols, and expedite drug discovery while upholding privacy and ethical standards.

To ensure success at scale, it is important to note that a centralised approach to aggregated datasets may be difficult to scale since every entity will be required to upload their datasets onto a single platform and they may be hesitant to part with their data. Even if they do share their data, ensuring it is regularly updated and synced to the main system would be an uphill task. A simpler approach is to create an open network policy for anonymised data sharing. Entities can engage the help of any technology provider and join the network to share their data under their own brand. This would give them recognition and control over their own datasets and make it easier to keep it updated over the long term. The choice on whether the data would be freely available at a cost could be a policy decision and the network would support both models.

Design Principles for crafting open data sets

  1. Federation by design: Rather than striving for a single centralised data repository covering all relevant data for the sector, it may be more pragmatic to continue to foster an ecosystem where multiple such datasets and data providers exist (even across multiple portals/platforms) each contributing to a broader pool of knowledge available to multiple innovators. It's crucial to note that harmonising the data schemas across all units isn't necessarily required - as long as each data publishing entity publishes the data schema used by their data set.

  2. Privacy by design to protect individual identity at all times: Small data sets require special attention when sharing aggregated results to ensure de-anonymisation is not possible.

  3. Open Access: It is key to ensure that each dataset is made available openly for others to leverage and reuse effectively through transparent policies.

  4. Open Standards: Promoting open standards for data sharing is key to enabling ease of reuse by software algorithms that access and analyse data. Open data schemas and api’s facilitate seamless access to data from multiple sources.

Decentralised non-personal data network

Accessing data is also a set of services that can be facilitated by APIs as per a standardised protocol for β€˜discovery and fulfilment’ of any good or service. A decentralised β€œnon-personal data access network” designed to facilitate access based on unified standards via a protocol such as can enable, as per standard APIs, the following data access services:

1. Discovery of various types of data sets across various agencies/entities (public/private).

2. Licensing/Contracting: data set licence conditions vary and both parties should contract before download/access.

3. Download/Access: Access methods range from dataset downloads to data-as-a-service models employing confidential computing. Advanced computing methods may establish data sandboxes instead of straightforward availability to enable deep learning networks for model training.

4. Pricing: While some data may be freely accessible, not all datasets are public or free. Public data is typically expected to be freely accessible but with certain analytics can also be made available at a price point. Pricing enables transparency for all players to make informed decisions and a sustainable ecosystem.

5. Update/Feedback cycles: Implementing automatic dataset update notifications ensures timely feedback cycles (for example yearly), enhancing data relevance over time

All of these processes occur across diverse agencies publishing data product/service catalogues within a decentralised network. This is crucial for ensuring that both public and private datasets are accessible within a unified decentralised framework.

The aforementioned architecture pertains to non-personal data (NPD) across public and private systems, whether through download/access or confidential computing models. All operations are decentralised, allowing data set services to be managed and updated by agencies worldwide.

Reference Examples

Note:

Personal data sharing is not in the scope of this document. Personal and Non-personal data sharing require two different approaches across architecture, policy and governance frameworks. Consent to opt in to share data for anonymisation is assumed to be part of the data sharing governance / policies of the source systems/platforms/entities and is outside scope of this document.

for Indian local language models for training, benchmarking.

πŸ“‚
Beckn
Language training data sets