Refined Annotation Leads to Precise Diagnostics: Boosting Pathology Dataset with Codatta’s Royalty Model

Codatta
5 min readDec 20, 2024

--

Introduction

The Enhanced Gleason Grading Annotations for the TCGA PRAD Dataset represent a collaboration between Codatta and DPath.ai, setting a new standard for AI-ready pathology data. By engaging an elite community of expert pathologists through Codatta’s platform, the dataset advances beyond traditional slide-level labels, introducing Region of Interest (ROI)-level spatial annotations that enhance diagnostic granularity, accuracy, and transparency. With refined Gleason grades, detailed reasoning, and ROI-based mapping of Gleason patterns, this dataset serves as a critical resource for AI model development and pathology research, addressing key challenges in creating high-quality annotated data. Through Codatta’s royalty-based model, contributors maintain ownership of their work, ensuring recognition and ongoing value as the dataset gains traction, while DPath.ai demonstrates how collaborative solutions can drive advancements in pathology AI.

Figure 1: Enhanced Gleason grading annotations for the TCGA PRAD prostate cancer dataset. Source: https://huggingface.co/datasets/Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset

What is the TCGA PRAD Dataset?

The enhanced Gleason grading annotations for the TCGA PRAD (The Cancer Genome Atlas Prostate Adenocarcinoma) prostate cancer dataset upgrade the original slide-level labels, incorporating Region of Interest (ROI)-level spatial annotations. Developed collaboratively by Codatta and DPath.ai, the dataset was created through a community of pathologists who enabled global contributions while maintaining annotation ownership. This approach enhances diagnostic accuracy, granularity, and reliability — critical elements for AI model training and pathology research.

Curated from 435 TCGA Whole Slide Images, pathologists identified 245 cases requiring label improvements and confirmed 190 cases as accurate. The dataset includes slide-level metadata and ROI-level spatial annotations, providing researchers with valuable resources for AI pipeline development, interactive tumor region exploration, and advanced pathology research.

Empowering Pathology AI: Codatta and DPath.ai Join Forces

The Enhanced Gleason Grading Annotations for the TCGA PRAD Dataset exemplify the potential of collaborative, community-driven data creation, while improving label accuracy and granularity, enabling more reliable AI model training and advancing medical research. However, such contributions — requiring domain expertise, time, and effort — highlight the need for a sustainable incentive structure that recognizes and rewards skilled practitioners for their work.

Royalty Model

This is where Codatta’s royalty-based model comes into play. It improves data contribution and access compared to traditional Web2 models like Scale AI. While Scale AI excels at meeting immediate liquidity preferences for generalists, enabling rapid and cost-efficient collection of large-scale data, it comes at a steep cost when engaging domain experts for specialized tasks — pricing out smaller players. Codatta, on the other hand, aligns with skilled practitioners and experts by offering conditional and asset-based rewards. As shown in Figure 2 below, these incentives appeal to contributors willing to invest in high-quality, specialized data for delayed yet potentially higher returns, making Codatta a perfect fit for Vertical AI and advanced applications requiring precision and expertise.

Figure 2: Mapping of Skill Proficiency to Liquidity Preferences in Data Contribution

Unlike Scale AI’s high upfront costs, Codatta’s royalty model eliminates financial barriers for smaller AI startups by introducing a pay-as-you-access system. This approach democratizes access to critical frontier data without requiring costly upfront investments, allowing startups to showcase their product-market fit and scale. Additionally, by transforming data into liquid assets within a decentralized financial market, Codatta ensures contributors can balance short-term liquidity needs with long-term asset ownership. Features like stipulative trading and fractional ownership further enhance liquidity, making asset-based rewards viable and attractive for a broader range of contributors. This alignment fosters collaboration, fuels innovation in niche AI applications, and creates a diversified investment ecosystem for data creators and startups alike.

DPath.ai: A Collaborative Solution for Pathology AI Data Challenges

DPath.ai is pioneering a decentralized platform designed to connect pathologists, researchers, and AI model developers globally. We source, curate, and exchange high-quality pathology data, empowering anyone interested in training their AI models. DPath platform leverages blockchain technology to ensure transparency, fairness, and secure data exchange.

Platforms like DPath.ai can leverage Codatta’s decentralized data protocol to source annotations collaboratively and transparently:

  • Task Definition: Clear annotation standards, such as Gleason grading for prostate cancer, ensure consistency and reliability in the resulting dataset.
  • Community Engagement: Skilled pathologists worldwide participate through Codatta’s platform, incentivized by its royalty-based model, earning ongoing rewards linked to the dataset’s future value.
  • Quality and Integrity: Blockchain-based verification and multi-party cross-referencing ensure traceable, high-quality annotations while enhancing the accountability of annotators.
  • Security and Accessibility: With the data stored in a decentralized manner, data ownership remains secure and accessible to the individuals involved.
Figure 3: Collaboration of Codatta and DPath.ai. Source: https://huggingface.co/datasets/Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset

By sourcing domain-specific data through collaboration, DPath.ai not only enriched the TCGA PRAD dataset with precise Gleason grading but also demonstrated how Codatta’s platform enables frontier data creation for specialized AI domains. This approach fosters sustainable participation, democratizes data access, and accelerates the development of equitable, high-performing healthcare AI systems.

Bringing It All Together

The Enhanced Gleason Grading Annotations for the TCGA PRAD Dataset, a collaboration between Codatta and DPath.ai, enhances diagnostic accuracy and granularity of pathology AI data with ROI-level annotations with reasonings. By engaging expert pathologists globally, the project ensures high-quality data while rewarding contributors through Codatta’s royalty-based model, which offers ongoing value and ownership. This approach also fosters collaboration, improves data liquidity, and accelerates advancements in healthcare AI, showcasing the power of decentralized, community-driven solutions.

Reference

  1. Codatta. (2024). Refined TCGA PRAD Prostate Cancer Pathology Dataset. Hugging Face. Retrieved from https://huggingface.co/datasets/Codatta/Refined-TCGA-PRAD-Prostate-Cancer-Pathology-Dataset
  2. Codatta. (2024). Revolutionizing Pathology Image Annotation: A Paradigm Shift to a Universal Labeling and Annotation Framework. Retrieved from https://codatta.medium.com/revolutionizing-pathology-image-annotation-a-paradigm-shift-to-a-universal-labeling-and-annotation-3f80b75c7b3f
  3. National Cancer Institute. (2024). TCGA-PRAD Collection. The Cancer Imaging Archive. Retrieved from https://www.cancerimagingarchive.net/collection/tcga-prad/

Meet Codatta

Codatta is a permissionless marketplace connecting data creators with demanders to curate valuable data resources, assetified on the XnY network. These assets fuel AI and DeSci projects with a royalty model that enables revenue sharing with creators.

🌐 Website|🆇 Twitter|💬 Telegram|👾 Discord|📱App

--

--

Codatta
Codatta

Written by Codatta

Codatta is a permissionless marketplace connecting data creators with demanders to curate valuable data resources, assetified on the XnY network.

No responses yet