MMIF and PBCore

Some notes on the mappings of elements from the MMIF file in the “everything and the kitched sink” example to PBCore (see index.md and raw.json).

The relevant information that we have in MMIF is in the following types:

  1. Instances of TimeFrame with frameType “bars-and-tone” or “slate”. These directly refer back to time slices in the video.
  2. Instances of SemanticTag with tagName “Date”, “Title”, “Host” or “Producer”. These can be traced back to the part of the video where the information was obtained (that is, the location of the slate), but this is not needed here because it is not required by PBCore (or even allowed in the PBCore elements that we would be using).
  3. Instances of NamedEntity with category “Person”, “Location” or “Organization”. These do need to be traced back because we want to index on the locations in the video where a subject occurs.

Tracing back to the source location requires some processing because while the information is available in the MMIF file it is not explicitly stated in the NamedEntity annotation.

A note on collaboration on this. The CLAMS team could do one of the following:

  1. Provide code and an API that makes it easy to get the information from a MMIF file that is needed.
  2. Create code that extracts the needed in formation from a MMIF file and outputs it in some kind of generic format.
  3. Create code that extracts the information and creates PBCore output.

Most of the work is in item 1 and that would be the minimal thing to do fro CLAMS, but this document assumes for now that the CLAMS team also creates PBCore output.

Mappings from MMIF to PBCore

The PBCore to be created has a top-level pbcoreDescriptionDocument element:

<pbcoreDescriptionDocument xmlns="http://www.pbcore.org/PBCore/PBCoreNamespace.html">
</pbcoreDescriptionDocument>

Within this top-level element we may add the following sub elements: pbcoreAssetDate, pbcoreTitle, pbcoreContributor, pbcoreSubject, pbcoreAnnotation and pbcorePart. The examples below for the MMIF example file raw.json are based on the descriptions in http://pbcore.org/elements and feedback from Kevin.

To map the MMIF time frames we need a need an element that allows us to express the type and the start and end times. The only one I can see that is not obviously intended for other uses is pbcorePart. It does require a couple of seb elements that are not really relevant for us:

<pbcorePart startTime="0" endTime="2600" partType="bars-and-tone">
  <pbcoreIdentifier source=""/>
  <pbcoreTitle/>
  <pbcoreDescription/>
</pbcorePart>

<pbcorePart startTime="2700" endTime="5300" partType="slate">
  <pbcoreIdentifier source=""/>
  <pbcoreTitle/>
  <pbcoreDescription/>
</pbcorePart>

The semantic tags in MMIF have direct and unproblematic mappings to PBCore elements:

Date → pbcoreAssetDate Title → pbcoreTitle Host → pbcoreContributor Producer → pbcoreContributor

<pbcoreAssetDate dateType="broadcast">1982-05-12</pbcoreAssetDate>
<pbcoreTitle>Loud Dogs</pbcoreTitle>
<pbcoreContributor>
   <contributor>Jim Lehrer</contributor>
   <contributorRole>Host</contributorRole>
</pbcoreContributor>
<pbcoreContributor>
   <contributor>Sara Just</contributor>
   <contributorRole>Producer</contributorRole>
</pbcoreContributor>

For the named entities we can use pbcoreSubject:

<pbcoreSubject subjectType="entity" annotation="Person" ref="SOME_REF"
               startTime="7255" endTime="8425">Jim Lehrer</pbcoreSubject>
<pbcoreSubject subjectType="entity" annotation="Organization" ref="SOME_REF"
               startTime="10999" endTime="11350">PBS</pbcoreSubject>
<pbcoreSubject subjectType="entity" annotation="Location" ref="SOME_REF"
               startTime="21000" endTime="21000">New York</pbcoreSubject>

I am not sure how to spin the attributes so this here is my best guesstimate.

The subject type is “entity” for all of these, the annotation attribute is used to store the category of the named entity, and the ref property is used to refer to some external authoritative source.

Start and end time are in milliseconds. For the first two they are generated by finding the tokens in the transcript text documents (by comparing start and end character offsets) and then tracking those to the time frames that they are aligned with.

For the third, we know the named entity occurs in some text document (created by Tesseract) and we track that document to the bounding box generated by EAST that the document is aligned with. That bounding box has a timePoint attribute that is used for both start and end time. Note that if there had be a second text box for the “Dog in New York” text (that is, if the time the image was displayed on screen was a little bit longer) then that box would have its own time point and the end time for “New York” would have been 22000.

Finally, here is all the above in one XML file, adding some identifier that we get from the input:

<pbcoreDescriptionDocument xmlns="http://www.pbcore.org/PBCore/PBCoreNamespace.html">

  <pbcoreAssetDate dateType="broadcast">1982-05-12</pbcoreAssetDate>

  <pbcoreIdentifier source="http://americanarchiveinventory.org">SOME_ID</pbcoreIdentifier>

  <pbcoreTitle>Loud Dogs</pbcoreTitle>

  <pbcoreSubject subjectType="entity" annotation="Person" ref="SOME_REF"
                 startTime="7255" endTime="8425">Jim Lehrer</pbcoreSubject>

  <pbcoreSubject subjectType="entity" annotation="Organization" ref="SOME_REF"
                 startTime="10999" endTime="11350">PBS</pbcoreSubject>

  <pbcoreSubject subjectType="entity" annotation="Location" ref="SOME_REF"
                 startTime="21000" endTime="21000">New York</pbcoreSubject>

  <pbcoreDescription/>

  <pbcoreContributor>
    <contributor>Jim Lehrer</contributor>
    <contributorRole>Host</contributorRole>
  </pbcoreContributor>

  <pbcoreContributor>
    <contributor>Sara Just</contributor>
    <contributorRole>Producer</contributorRole>
  </pbcoreContributor>

  <pbcorePart startTime="0" endTime="2600" partType="bars-and-tone">
    <pbcoreIdentifier source=""/>
    <pbcoreTitle/>
    <pbcoreDescription/>
  </pbcorePart>

  <pbcorePart startTime="2700" endTime="5300" partType="slate">
    <pbcoreIdentifier source=""/>
    <pbcoreTitle/>
    <pbcoreDescription/>
  </pbcorePart>

</pbcoreDescriptionDocument>