Skip to Content

How to Use APDFL for the Text and Data Mining Reservation Protocol (TDMRep)

Estimated Reading Time: 5 Minutes

What is this?

 

Nowadays, a lot of content on the web is fetched by machines in the interest of training generative AI—and on a larger scale, for Text and Data Mining (TDM) as a whole. The below diagram from a document on the matter explains this further:

 

A lot of this content comes from PDFs. The goal of this article (and its supporting documents and technologies) is to talk about how to add machine-readable language to a PDF that serves one or more of the following use cases: 

  1. Express rights reservation to TDM actors on the content of a PDF
  2. Provide information to TDM actors on how to legally obtain permission to fetch the content of a PDF (and compensate the rightsholder(s), if applicable)
  3. Opt-out of the fetching of content by machines entirely

How does this work?

 

The TDM Reservation Protocol, or TDMRep, covers the use of an “opt-out” mechanism for TDM—specifically 1) whether specific content may be fetched and 2) how rightsholders can be contacted to obtain a license to fetch said content, if applicable. The latest version of the spec (as of this writing) is at: https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240510/

 

There are two properties to TDMRep’s rights reservation model, as follows:

 

                tdm-reservation (boolean) indicates if mining rights are reserved or not.

                tdm-policy (URL) gives access to publishers’ contact information and conditions for obtaining authorization to mine content.

 

The latter is technically optional, but leaving it out may render the effort otherwise ineffective.

 

These properties can be added to a PDF’s XMP metadata either proactively or retroactively by an author or publisher to express the reservation of rights and provide a gateway for obtaining licensure as it pertains to content acquisition of content by text and data mining. See the below example from W3C:

 

Source: https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240510/#example-8

 

What does a TDM Policy look like?

 

A TDM policy is in JSON-LD (JSON for Linked Data) format and uses standard terminology from ODRL (Open Digital Rights Language), a proposed language for the standardization of digital rights management. Here’s a list of required properties for the JSON structure, followed by a couple of example policies, courtesy of W3C:

 

  1. @context: an array of two values, specifically “http://www.w3.org/ns/odrl.jsonld” and “http://www.w3.org/ns/tdmrep.jsonld”
  2. uid: an identifier for the policy, expressed as a URI. It is not required that this link references a publicly-accessible resource, but that is encouraged. W3C makes a simple recommendation to link to the website’s human-readable Terms of Use.
  3. @type: The value of this property must be “Offer”. Per the ODRL model, Offers are “proposals from Rights Holders for specific Rights over their Assets.”
  4. profile: The value of this property must be “http://www.w3.org/ns/tdmrep”.
  5. assigner: This property contains a set of child vCard properties for contacting the rightsholder. By our best interpretation of the spec, it is not required that all of these vCard properties are used, but if any are used, they must be within the list provided by section 7.1.4 of the TDMRep spec.
  6. permission: an array of permissions. By our best interpretation of the spec, this array only has one required element—“action”: “tdm:mine”—which expresses that the permission to be defined in this policy is to “analyse, via automated analytical technique, text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations”. In the interest of best serving the purpose of this spec, special mention is to be made here that a “duty” element is also listed, which expresses the TDM Actor’s duty to “obtain verifiable consent”. The spec doesn’t explicitly say that this element is required, but if it is not, it seems important enough to warrant at minimum a strong recommendation.

 

Note that there are other JSON properties listed in the spec that fall under “recommended” and “optional”. The above list is for strictly required properties, only.

 

Here is a simple example created by Datalogics:

 

  {
      "@context": [
        "http://www.w3.org/ns/odrl.jsonld",
        "http://www.w3.org/ns/tdmrep.jsonld"
    ],

    "@type": "Offer",
    "profile": "http://www.w3.org/ns/tdmrep",
    "uid": "https://www.datalogics.com/terms-of-use",
    "assigner": {
      "uid": "https://www.datalogics.com/",
      "vcard:fn": "Datalogics",
      "vcard:hasAddress": {
        "vcard:street-address": "1207 Delaware Ave Suite 1810",
        "vcard:postal-code": "19806",
        "vcard:locality": "DE",
        "vcard:country-name": "USA"
      },
      "vcard:hasURL": "https://www.datalogics.com/datalogics-contact-us" 
    },
    "permission": [{
      "action": "tdm:mine",
      "duty": [{
        "action": "obtainConsent"
        }
      ]
    }
  ]
}

An example by W3C can be found here: https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240510/#example-19; this example contains a structure for defining access as it pertains to scientific research, which can vary greatly by locale.

 

How can we use APDFL to express TDM rights reservation in a PDF?

 

For this initial example, we are going to be using APDFL’s .NET interface. To summarize: we need to affect the XMP metadata to express the properties “reservation” and “policy” under the namespace “http://www.w3.org/ns/tdmrep/”. APDFL’s Document class has a member we can use to set an XMP Metadata property, aptly named SetXMPMetadataProperty(). That might look like this:

 

doc.SetXMPMetadataProperty(tdmNS, "tdm", "reservation", "1");

doc.SetXMPMetadataProperty(tdmNS, "tdm", "policy", "http://www.test.com/some_tdm_policy");

 

Where “doc” is a Document object, “tdmNS” is a shorthand for the aforementioned W3C namespace, and “tdm” is the property’s prefix in XML. Ultimately this should end up adding XMP metadata properties to the PDF similar in appearance to those in the XML screenshot from earlier in this article.

 

Datalogics has a sample application written in .NET that can do the above—you can find it here: https://github.com/datalogics/additional-apdfl-csharp-dotnet-samples

 

Note that APDFL’s Java and C++ interfaces have equivalent functions: Document.setXMPMetadataProperty() and PDDocSetXAPMetadataProperty(), respectively.

 

What if I want to do the scraping? Can APDFL do this?

 

Yes, in addition to the sample application above, Datalogics also offers a sample application that will extract text from a PDF and print TDMRep metadata to the Console, if there is any. It does this simply by checking for the relevant properties in the PDF’s XMP metadata, and if they exist, using a JSON parser to return the contact information of the rightsholder. That application can be found at the same GitHub link above.

How to Use APDFL for the Text and Data Mining Reservation Protocol (TDMRep)
  • COMMENT