Ong Blog

A IIIF Concordance

As IIIF Manifests increasingly contain transcription annotations, the tools that are able to handle these documents must also treat text content as a first-class citizen. In our TPEN software, users regularly generate new annotations with text content that is included in the manifest documents made available for every project. In addition, TPEN allows for the easy inclusion of split-screen tools in the transcription interface, so I often spend some time experimenting with possible tools that assist in transcription, connect resources, or generate useful visualizations.

Welcome the 
Kaminski Handwriting Collection

Several months ago, OngCDH opened a conversation with the 
Kaminski Handwriting Collection
, where David Kaminski has been working to capture samples of handwriting from all over, noticing that American Paleography is not formally studied with significant scholarly focus. Kaminski has been customizing TPEN in support of American Paleography and contributing code back to our software development. Recently, an effort by Ethan Kaminski caught my eye.


This visualization loads a TPEN project, analyzes the text content, and creates a concordance of the words transcribed, along with a count of occurrences. For the purposes of paleography, this tool would help find and compare popular words and letterforms throughout a document or collection. As a split screen tool in TPEN, it would illuminate common phrases and serve as a simple search-style tool.

As an experiment, I first changed the “projectID” lookup to harvest any IIIF Manifest URI, so the tool would be useful for any published documents. This included support for fetching resources that aren’t included in the original document—a best practice for AnnotationList objects used in the Presentation 2.x API. Once the tool was able to do that, I set my sites on the other connected information available to the interface.

Filtering the collection by the length of the word and its frequency seemed like an obvious place to start. Combined with standard ascending/descending sorting, a user could jump past stop words or focus on interesting cases without any configuration.

More usefulness was added by adding links to scroll the main page of entries to the selected word and adding the occurrences count to each entry. Without spending some less experimental time, capitalizing on the availability of each annotation’s target to show the image fragments or provide links to the line specifically within applications like TPEN or Mirador will have to wait. Honestly, without specific use cases, I don’t have a lot of energy behind any case beyond integration into TPEN, but I am grateful to the Kaminskis for contributing momentum to both TPEN software development and the study of American paleography.

Challenges to overcome

There are standards that guide the creation and distribution of complex objects, like collections of books’ and manuscripts’ images; encoded transcriptions, translations, and commentary; and the annotations that are designed to link them together. Even if the paleography community across domains had established common conventions for combining these standards, covering cases even as narrow as transcriptions and translations introduce all sorts of inconsistencies. (Bryan Haberberger has covered many of these challenges in another post.) Supporting only TPEN manifests means IIIF Presentation 2.1.1 and expecting text content to always be in manifest.sequences[0].canvases[n].otherContent[0].resources[n].resource["cnt:chars"], which makes a lot of assumptions. Looking to a transcribed Stanford manuscript, the AnnotationList must be resolved first and then the characters are in resource.chars instead. In other manifests, text is a linked TEI-XML resource, covering the entire document, or transcription services supply page-at-a-time text, as FromThePage does. All this is just trying to find the text content before dealing with the fact that the text itself may contain partial XML markup and uses character sets, notation conventions, and languages that escape the most conveniently available regex snippets for tokenizing modern texts.

Bibliothèque nationale de France, NAF 6221, 
Provided by the Stanford University Libraries illustrates how nearly equivalent terms are not normalized and how notation conventions can complicate the analysis.

New Opportunities

There is absolutely no shortage of tools for word frequency and concordances. However, these tools all expect a level of preparation or access that is far from universal. It works well for focused projects, but few people have datasets that they can use for a quick look at an interesting visualization. On the other hand, IIIF Manifests are proliferating and, for all their capriciousness, tend towards conformance. Moreover, the data within Web Annotations on Canvases includes image fragments and other bits of information that is truly interesting to researchers and their robots. Considered briefly, I would categorize the next possible steps thus:

  1. Integration—the use of this tool within TPEN is simple with the split screen iframe configuration. The standard Capelli tool (git) is a standalone application that just works. For the specific purposes of TPEN, integration could be improved to link back to specific project pages, for example. Other forks may also be established for connections within other web applications.
  2. Visualization—in the digital humanities, the most boring datasets are often gilded with impressive word clouds, dynamic graphs that jostle under your cursor, and science-y distributions that turn conclusions into PowerPoint kibble. While I would resist piling these on without a use case, web applications like Voyant Tools have proven that flashy still attracts a lot of eyes (and it is customizable). In this same category, one would place UX/UI improvements, clarifying the analysis, providing better filtering and searching, etc.
  3. Investigation—Each line of results has an annotation associated with it that links to image fragments, Canvases with metadata, and Manifest documents. A summary page of “orgueil” may show all 12 instances in the original texts for comparison. A “bookbag” may allow users to hold onto a few lines to compare against another text. Most importantly, as a standard document format, a user could follow a link out of the application to transcribe this document in TPEN, view the manuscript in Mirador, or export the results in a format expected by a more comprehensive, albeit picky, analysis application on or offline.

This is happening


There is a 99% chance that this tool is handy enough that I will throw a few more days against it to integrate it with TPEN. A new post here will announce its inclusion on the default tools set and users will be oblivious to it. After awhile, I may combine several public transcriptions together and release a supercut output of existing transcriptions. At that point, we’ll pretend it is brand new and present it in a community call or at a conference and amaze scholars who will now clamor to sign up.

Of course, it would be even nicer if there were specific use cases beyond our tool that could help establish a motivating roadmap for development. The code for this is in fact so simple that many may simply be moved to contribute in a small way. Please drop an issue explaining your use case or feature request; appreciate that the code has no dependencies and requires no building or compiling and contribute; try out a sample; or add your own manifest parameter to and see if your favorite document chooches and Pinstatweet it!

Exploiting Client-side Line Detection

This continues a previous post that introduces the minor piece of code we’ll be exploring below.

Hello, Old Friend

Recently, two events coincided that inspired me to pull this code back out and take a second look at the process. The first is that our center charged a group of Computer Science majors with improving the effectiveness of our image processing algorithm as part of their senior capstone project; the second was the seemingly sudden burst of HTR promises, which depend on some level of layout analysis to work. In both cases, I was struck that improvements were accomplished in all cases with more processing power and additional layers of analysis. Although more of the outlier cases were falling into scope and well-formed documents were becoming more automatable, the simple cases were moving from realtime (<8 seconds) and into delays of minutes or hours in some cases before interaction with the results became possible. I do not want to diminish the scale of these accomplishments or sound like someone who gripes that I must wait almost two hours to board an airplane that will take me to the other side of the world in half a day. However, there are certainly use cases at the lower end of the complexity spectrum that may not require and cannot benefit from the horsepower being built into these new models.

I honestly don’t know where this sample image came from (aside from the British Library), but it was in my cache when I lost WiFi years ago. It was time to feed this to the machine and see what happened. In short order, I wrote up a function to visualize the sums of the rows and columns to see if the text box seemed to be obvious. The result felt conclusive:

Setting a default threshold of 10% of the busiest row (marked in black beneath the image), the possible columns popped out as expected. I was also pleased to see that candidate rows appear without too much imagination. Obviously, there are some spots, such as the gutter and the page edges that do not represent a text area, but by simply constraining the width of the analysis and expecting the sawtooth of rows, I not only eliminated irrelevant “columns” but was able to detect separation within a column. I can easily imagine bracket glosses, round text paths, or heavily decorated text that would break this, but those are not my target. With no optimization and the inclusion of several heavy visualizations, I was able to render decent candidate annotations for column and line detection in about two seconds. At the lowest resolution, this time was under one-fifth of a second.

Things take a turn

Rather than declare victory, I investigated the minor errors that I was able to find. While I expected and accepted pulled out capitals and paragraph numbers as well as the mischaracterization of a header graphic as text, it bothered me that one pair of lines was joined, though the visualization suggested their separation. I could fiddle with the thresholds to get a better result, but that also thinned the other lines beyond what made sense to me, so it was not a solution. Stepping through the numbers, it seemed that the slight rotation magnified the impact of the ascenders, descenders, and diacritics that busied up the interlinear spaces. It would not be unreasonable for this lightweight tool to require pre-processed images with good alignment, but some simple poking told me the amount that this image was “off” by was just around -.75 degrees which feels close enough for most humans to consider this a good photo. Instead I began to imagine the shadow cast by a rotating text box and experimented with rotations that made the column curve more round or severe.

They were mathematically different, but determining the best fit was becoming more and more complex, which undermined the original purpose. A simple check of the rotation that produced the narrowest column was possible, and seemed to always be true for the best rotation, but automating that step was difficult on multiple columns and it was too easy to miss the best rotation if the interval was set too high. I looked at column widths, row counts, the difference between max and min values for a host of metrics, but nothing reliably predicted the correct rotation.

Always Assume

After carefully recording and comparing characteristics of good and bad fits across several images, I discovered an assumption about manuscripts that I was not yet leveraging—rows are regular. Even with variety, most ruled manuscripts will be dominated by rows of similar heights. I updated the function to select the best rotation based on the minimum standard deviation from the mean value for  row height. This calculation is lightweight for the browser and the rows are already calculated at each step of determining column boundaries, so there was minimal overhead. As a default, I evaluate each degree from -3 to 3 and then rerun around the lowest value with half the interval until the interval is under one-eighth of a degree. Without caching checks or eliminating intermediate renders, this process takes longer, but it regularly finds the best rotation for a variety of documents. On my machine, it takes about 1 millisecond/pixel processed (40 seconds with the sample image), but the back of my envelope records 922 of these tests as redundant, which means a simple caching optimization will put this process under twenty seconds. Using this method, an incredibly problematic folio (microfilm, distorted page, skewed photography, tight lines) is not only rotated well, but is evaluated with incredible precision.

Robert Grosseteste, Eton College Library 8

Full page, rotated 1.375 degrees, 52 rows in 2 columns

Next Steps

This is not remarkable because it is possible, but because it is mathematically simple and reasonable to accomplish on a client. This not only means the transcription (or generic annotation) application does not need to sideload the image to process it, but also that any image can be offered, even one off the local machine or that doesn’t use helpful standards like IIIF. One can imagine this analysis may be available for any image within the browser through a bookmarklet or extension. Once analyzed, these annotations could be sent to a service like Rerum, saved into LocalStorage for later recall, or sent directly into a transcription tool like TPEN.

Within an application, this tool can be even more powerfully used. Without requiring a complex API to save settings, a user may tweak the parameters to serve their specific document and reuse those settings on each page as the interface renders it. Even if the line detection is inaccurate or unused, the column identification may be helpful to close crop an image for translation, close study, or to set a default viewbox for an undescribed page.

This is not part of any active project and just represents a couple days spent flogging an old idea. The whole thing such as it is has a github repository, but isn’t going to see action until there is a relevant work case. What do you think? Is this worth a few more cycles? How would you use a tool like this, if you could do lightweight image analysis just in your browser or offline?

Experimenting with Client-side Line Detection

Does not compute

Using an “old” iPad on a plane to review transcription data was a clarifying task. For all the advances in research technologies, even simple tasks, such as viewing manuscript images on an institution’s website can crash a five year old browser, effectively rendering it inaccessible. I am not willing to accept that the very tools and scripts we have been building to make these resources more interactive and discoverable are also rendering them inaccessible on aging (but still functioning) hardware. There is a place for discussing progressive enhancement design, progressive web applications, and emerging mesh-style protocols like IPFS, but I’m going to be very targeted in this post. The choke point of manuscript image analysis has always been the server-side task of layout analysis (as in our TPEN application) and has been making great advances with the addition of machine learning in computing clusters (Transkribus and others are in the spotlight at the moment). I am calling for an algorithm simple enough to run in the browser of an underpowered machine that can accomplish some simple tasks on “decent” photography.

WiFi not available

Imagine you are in a magic tincan that zips through the air at high speeds and connects you simultaneously to all the world’s knowledge. From these heights you work away, paging through images of a medieval manuscript and transcribing it into a digital language that encodes it for limitless reuse. You are working at not because it is the best at image analysis, but because its servers run an algorithm good enough for your clean document and do so in real time, returning line detection on each page in mere seconds—at least it used to. As the Internet connection gets spottier, the responses become slower. You wait eight seconds… thirty seconds… and then silence. Your mind reels trying to recall that YouTube video you watched on EM waves and to resist blaming this outage on a vengeful god. A full minute without WiFi passes and you realize there is a chemical bomb on your lap that cannot even entertain you. It would have been more reliable to carry a pencil and a sheet from a mimeograph with you than this unusual pile of heavy metals, polymers, and pressure-cooked sand. What else about your life have you failed to question? Do you even really grasp the difference between air speed and ground speed? How planes!?

Dash blocks and breathe

I was unable to answer all these questions for myself, but I did start to wonder about what minimum effective image analysis might look like. Existing algorithms with which I was familiar used very generic assumptions when looking for lines. The truth is that manuscripts can be quite diverse in form, but photographs of them taken for transcription strongly tend towards some similarities. For this experiment, I am dealing with manuscripts where the text is laid out in rectangular blocks and takes up at least a quarter of the image. I wanted to find something that could deal with the dark mattes, color bars, rulers, and other calibration paraphernalia. Ideally, it would be able to find text boxes and the lines within, even if the original image was slightly askew or distorted. Algorithms that looked only for dark areas were confused by mattes and often rated a red block as equivalent to a column of text. Strictly thresholding algorithms lost faded tan scripts on parchment easily. My solution would need to be good enough to run in a vanilla state and quick enough to calibrate for special cases if needed.

I did not look for dark spots, but for “busyness” in the page. While some scripts may have regions of strong linear consistency, most scripts (even character-based ones) are useful by their contrast to the plainness of the support medium.

Sample image processed for “busyness”

I began, on that airplane ride, to write a simple fork of some canvas element JavaScript filters I had bookmarked a long time ago. Simply, I redrew the image in the browser as a representation of its busyness. What I dropped on Plunker when I landed took each pixel and rewrote it depending on the difference between itself and the adjacent pixels on the row. I was excited that with three very different samples, the resulting visualization clearly identified the text block and reduced the debris. By then the plane had landed and I put away my childish fears that technology would ever abandon me.

Finding Value

In the next post, I will discuss why I opened up this old pile of code again to see if I could teach it a few new tricks. I am curious, though, what snippets or small concepts do you have in a dusty digital drawer that might be useful. Use the comments here to advertise the github repo you haven’t contributed in years, but still haven’t deleted.

Authentication and Attribution in RERUM

Any new web service or application must take a considered look at authorization, authentication, and attribution—authorization, to make changes to data; authentication, to ensure those making changes are known; and attribution, to apply proper credit for contributions. The prevailing practice is to authenticate users within applications and using appropriate context to make attributions. Popular transcription software, like TPEN and FromThePage, rely on user accounts and a private database to authenticate, attaching attribution based on the user’s account information in the interface and whenever the data is exported, for example, as a IIIF manifest document. Our goal to make RERUM a potent supplement to the heavier data APIs these type of interfaces rely on forced us to reevaluate the “obvious” choice to create and authenticate users.

Data Services and the User Problem

For Example

Annotation services like and Genius follow the user authentication model and work well as plug-ins to other applications, but are never truly integrated. They are much more standards-based and reliable than Facebook comments or modules like Discus, but still require a user account to authenticate. The problem with user accounts is that they expire even faster than users themselves. As an example, let’s consider how analog records are made more durable. At, there is a document representing a book published in 1862. The artifact itself claims J.C. Clendon, M.R.C.S is the author, but it does not lean only on that assertion. Immediately following the authorship statement is a qualifier connecting Clendon to Westminster Hospital and the Hospital School. As further authentication, the title page also offers the Greenwich Medical Society as witnesses to its reading. Finally, to lend trust to the assertion of the credentials presented, a publisher is also presented.

Analog attribution and authentication


In the arc of the history of scholarship, it may be difficult to check the specific credentials of Clendon, or even to establish his uniqueness. However, the historical footprint and trustworthiness of the affiliated institutions may cast a longer shadow. Certainly, the combination of elements creates a useful attribution. As a data service, then, RERUM requires only an application to authenticate itself, allowing it then to assert whatever it may about its own content. In this way, the __rerum property is a trustworthy record of the application asserting the content. If the transcription annotation asserting authorship were saved by Wellcome into RERUM, it would be sent in as:

    @id: "",
    @type: "oa:Annotation",
    motivation: "sc:painting",
    resource: {
        @type: "cnt:ContentAsText",
        format: "text/plain",
        chars: "J. C. CLENDON, M.R.C.S.,"
    on: ",1641,931,65"

As an external resource (details), this update action will create a derivative of the (coincidentally non-dereferenceable) annotation and assign a new URI:

    @id: "",
    @type: "oa:Annotation",
    motivation: "sc:painting",
    resource: {
       @type: "cnt:ContentAsText", 
       format: "text/plain", 
       chars: "J. C. CLENDON, M.R.C.S.," },        
       on: ",1641,931,65"
    __rerum: {
        // URI to foaf:Agent for Wellcome
        generator: "",
    ... }

Among other properties in __rerum  the generator property points to the Agent that was created for the application when it obtained its API key. Now, there is a dereferenceable location available and anyone consuming the annotation can trust Rerum to authenticate the Wellcome Library as the generator and that the content is from them. If any other application makes additional modifications, those versions also identify the generator, allowing for branches or filtering as needed by consuming interfaces.


In this way, no user is ever authenticated by Rerum and attribution is pushed only as far as is possible with certainty and reasonable durability. In this case, Wellcome makes no claim within the annotation itself as to its attribution and even looking up the manifest only offers the backing of the Library (which may be enough for many cases). Another example is from TPEN: This line of transcription looks very similar to the Wellcome example, but includes _tpen_creator, an undefined, but consistent property. Someone familiar with TPEN’s API could use the “637” value with the /geti?uid= service to discover public user information. The flexibility of the authentication system in Rerum allows, of course, for the well-attributed document that takes full advantage of RDF and the Linked Data graph to connect to meaningful and dereferenceable IRIs, but it also degrades without breaking as the external links become harder to resolve or understand.

Deleted Objects in RERUM

In the last post, we explored how the tree of the version history is healed around a deleted object. In this post, we look more directly at the transformations to the deleted object itself. Let’s take the same abbreviated object to begin:

{ "@id"     : "",
  "body"    : "content",
  "__rerum" : {
       "generator" : "Application",
       "isReleased": "",
       "history"      : {
           "prime"    : "root",
           "previous" : "",
           "next"     : [ "" ]

The Case for Breadcrumbs

Because we are removing it from the versioning, the original tree structure and its position are irrelevant. However, it must be obvious to consumers that something has changed and that their version of the object is no longer intended to be valid, while offering a way out. Two often used solutions that fail here are removal and flagging. If the object were no longer available at its original @id, interfaces (for example, a reference to a transcription annotation in an online article) would simply break and developers would have no way of recovering the data or seeking an alternative. If a property like deleted:true were added, it would be so ineffective in updating the object that the interface would continue to render out-of-date information unless the developer had the foresight to specifically check for this flag.

To break the response without stranding the client, we need to disrupt the expectations of the interface without discarding the object data. Our solution is to move the entire object down one level into a __deleted property, breaking unaware interfaces, but still allowing developers to access the original object. Consider the deleted version of our sample:

{ "@id"       : "", //same ID
  "__deleted" : {
       "time"   : "1515183834857", //time of deletion
       "object" : {                //snapshot of object at deletion
           "@id" : "",
           "body" : "content",
           "__rerum" : {
             "generator" : "Application",
             "isReleased": "", 
             "history" : {
               "prime" : "root",
               "previous" : "",
               "next" : [ "" ]

In a typical interface from a consuming application, the rendered display would resolve the @id and bind to response.body. This is now undefined, and an aware interface (or a vocal audience) will quickly alert the developer of the change. Even a human who was ignorant of the RERUM API should be able to grok that the “new location” of their data is __deleted.object.body and the real-life meaning of that move should feel like plain English. At this point, the developer can choose to update the display to render the deleted object (hopefully with some notice) or replace it with an adjacent version referred to in the history property.


In this system, there is no way, through the API, to actually obliterate an object or any version of it. This is important because of the interrelationships between all the versions and the risk built into true deletions being committed pell-mell. The MongoDB behind our current version of RERUM still supports complete removal, if it becomes required, but we believe the presented solution will break interfaces appropriately without marooning them.

So what do you think? Does our solution stray too far from convention? Are we inviting disaster trying to enforce standards while messing with objects? Let us know in the comments or submit an issue on Github.

Forgetting Deleted Objects in RERUM

At the Walter J. Ong, S.J. Center for Digital Humanities, we have been working hard on RERUM, the public object repository for IIIF, Web Annotation, and other JSON documents. The latest feature we’ve been diving into for the 1.0 release is DELETE. As is covered in the documentation on Github, there are a few guiding principles that are relevant here:

  1. As RESTful as is reasonable—accept and respond to a broad range of requests without losing the map;
  2. As compliant as is practical—take advantage of standards and harmonize conflicts;
  3. Trust the application, not the user—avoid multiple login and authentication requirements and honor open data attributions;
  4. Open and Free—expose all contributions immediately without charge to write or read;
  5. Attributed and Versioned—always include asserted ownership and transaction metadata so consumers can evaluate trustworthiness and relevance.


Setting aside for now some of the nuance of the common REST practices and the Web Annotation Standard, we needed to determine what type of request we would accept, what transformation would happen to the data, and how the service would respond to the client application. Authentication (covered in depth in another post) is the first concern, since an Open and Free repository is a target for vandalism. Honestly communicating the “deleted” status of an object without breaking all references to it is served by appropriate Version Control (see here). Combining these concerns, we realized it was not enough to “just remove” or “delete-flag” an object (specifically, a version of an object)—our repository has to heal the version history tree around it.

Let’s start with a abbreviated object (prime shown):

{ "@id"     : "",
  "body"    : "content",
  "__rerum" : {
      "generator" : "Application",
      "isReleased": "",
      "history"   : {
          "prime"    : "root",
          "previous" : "",
          "next"     : [ "" ]

Whose version history tree (in the simplest complex form) looks like this:

A single object in 9 versions, updated by 3 different applications (A, B, C)

Before attempting any deletion, there are 3 checks that must be passed:

  1. Is the request coming from the original generator?
    In an open system, we offer endless opportunities to extend the work of others, but deleting is destructive, so it may only be done by the application that originally created this version.
  2. Is this version Released?
    RERUM offers few promises to client applications, but the immutability of Released objects is one of them.
  3. Is this version already deleted?
    While it may not have a huge effect to “re-delete” a version of an object, it feels dishonest to the client to do so.


There are three possible positions a version may be in (in ascending complexity):

  1. Leaf without descendants: 05, 08, 09 above;
  2. Internal with a parent and at least one child, 02 and 07 being the most complex; and
  3. Root as the __rerum.history.primevalue of “root” indicates in node 01.

All these options follow this process:

In the first two cases, where the deleted version is not “root”, the healing is fairly straightforward. The chain is simply pinched around the version, removing the object—those listed in the array are updated (in h.previous) to point to the version from which the deleted version was derived. The array in that version is updated to bypass the deleted object. In the case of deleting 02, above, 03 and 06 would now claim 01 as a parent and 01 would now have them as children.

If, however, the deleted object is the original instance, the troubling reality is that every version refers to it as the value. Moreover, by deleting it, the client has made an adjustment to reality, removing the timestamped “first moment” of the history. Trusting the client application and holding to our principles of freedom, we must allow this and document it well. For each child of 01, the original tree is broken into a new tree with that node as its root. This feels destructive, but reflects the intent of a client requesting such an action. Within that new tree (02, here), becomes “root” and the @id of the new root must be populated to all the descendants. The new prime also enters an unusual state (explored more here) where is “root” and there is a value for h.previous. By referring to the deleted object, it remains possible for an enterprising researcher to reconstruct the historical relationships in this tree, though no query will expose it simply.


What of the case where I want to obliterate and entire object, including all versions ever generated? Though a simple case from the perspective of user intention, it becomes logically complex. We have not supplied a shortcut method for this yet, but the logic of the process follows the chart above in iteration. Let’s take the applications A, B, C above and attempt to blow away this unwanted object.

Application A

This application seems to have the most right to commit this act, as the owner of the “prime” instance. However, there are intervening conditions here. No matter if you burn the tree from leaves to root or root to leaves, the result is the same. Removing 01 simply updates the tree to have 02 as prime and all nodes are updated. Removing 02 does the same with the side-effect of creating a tree from 03(A) and 06(B). Continuing to remove the 03 tree is easy, but an attempt to remove the 06 tree fails immediately, as “A” no longer “owns” it. This is appropriate, since “A” was allowed to purge the history of its contributions, but not those derived from them in the meanwhile. Clients using 08 and 09, will now only see the of 06, though they may traverse the records back to the original prime if they wish.

Application B

This application has the reverse problem from “A” as a derivative creator. No attempt to delete anything above 06 will be possible, full stop. Blowing away the 06 branch, however, will meet with similar constraints as “A,” since “C” has established a derivative branch on 07. As 07 is deleted, those versions in the array will be updated to point to 02 (the result of deleting 06). Once all allowed actions are finished, 09 will have effectively replaced 06 in the 01 tree. No active node within the tree will retain a reference to “B” versions, though (see below) the deleted objects will report their placement.

Application C

This application is the most limited, as it has only contributed one version, a leaf, to the tree. While it is unable to impact the major structures, it is also able (like “B”) to remove itself from the history completely, leaving the only traces of its original participation in the objects themselves.

The Debris

What is left after a successful deletion in the history tree is important to clients who continue to depend on the leaves, but what of those consumers who had referenced the deleted object? Following our principles, we intend to make it obvious to consumers that something has changed and that their version of the object is no longer intended to be valid, while offering a way out.

We think our solution is quite clever, but that’s a post for another day.

Versioning in RERUM

Versioning as it is known in software is simply the process of preserving previous iterations of a document when changes are made. There are many systems available to the developer which differ in centralization, cloning behaviors, delta encoding, etc., but for our purposes, the philosophy and utility should suffice.

From a mile up, versioning resembles different editions of a published work. Even when the new work supersedes the previous, both are maintained so that they can be cited reliably and the changes between them may become an interesting piece of the academic conversation as well. In the digital world, new versions can come quickly and the differences between them are often quite small. However, in an ecosystem like Linked Open Data, that presumes most resources are remote and have a permanent and reliable reference, even small gaps can magnify and create dysfunction.

Permanent Reference

The idea of a PURL for resources on the Internet is an old one. Most HTML pages for records in a collection include a “reference URL” or “cite as” which indicates to the user how to reliably return to the object being viewed. With RERUM, we want to extend the courtesy of a stable home to sc:Manifestoa:Annotation, and similar objects that are more often viewed only as attached to resources like books, paintings, and other records of “real things.”

Upon creation in RERUM, an object (henceforth an annotation, for ease of example) is minted a novel URI (@id, in JSON-LD) where it will always be available. Whenever a trusted application makes a change, a new version is saved and given a new URI, connected to the previous annotation. In most cases, the significant changes will be to the oa:hasBody or oa:hasTargetfields, but any altered property may make an existing reference unreliable, so the new URI is returned to the application while the old reference remains stable.

Branches: the family tree of digital objects

In the wide world, it is likely that two or more applications may be referencing and even updating the same annotation for different reasons. As an example, let’s consider transcription annotations made in a standards-compliant application, such as TPEN, and Mirador, an open viewer for IIIF objects. A museum may digitize a manuscript and open it up for public transcription, creating many varied annotations throughout the document, identifying image fragments and text content. In RERUM, these annotations are not only associated with each other in the Manifest shared by the museum, but every individual annotation attributes itself to the application (if not the user) that created it or offered updates. This case for versioning is simple to see and accommodate.

Now a prestigious paleographer comes along wishing to complete an authoritative edition of this manuscript. She will begin a project in the transcription application using all of the annotations that existed at the time. As she goes through and applies corrections and conventions for expansions and spelling and punctuation, she is updating the annotations, but in her own project, not that of the public project from the museum. This works because the museum projects annotations still exist and the references to them have not changed. If someone else suggests a change to an annotation for which the paleographer has already created a more recent version, the history of the annotation branches, allowing both current versions to share ancestry without collision.

Timeline Breakdown

1. Created four lines on a page in the museum project:

{A1:Museum1}, {A2:Museum1}, {A3:Museum2}, {A4:Museum1}

2. The paleographer begins work and makes a change on the second line:

{A1:Museum1}, {A2.1:Paleographer1}, {A3:Museum2}, {A4:Museum1}

but the museum project remains unchanged.

3. When Museum2 corrects an error in Museum1’s transcription (cooincidentally the same error the paleographer caught):

{A1:Museum1}, {A2.2:Museum2}, {A3:Museum2}, {A4:Museum1}

the public project is updated without breaking the paleographer’s edition.

So the history of A2 looks like this:

Now the model works for everyone. The museum has an open project, the paleographer has a project she controls, and every contribution through history is automatic. The problem that has been introduced, however, is how should a Manifest viewer display these annotations? To show them all in all their versions would quickly overwhelm, but depending on the moment the “most recent” annotation may be from either branch. Moreover, the paleographer may not intend for these annotations, though stored in an open repository, to be considered a finished project, suitable for display or citing. The simplest answer is that a repository can publish their Manifest with all the intended annotations included, but that undermines the possibilities of an open store.

Digital “Micropublishing”

RERUM supports and encourages the use of a few oa:Motivations that have not been used before. In addition to the Motivation on each well-formed annotation that should indicate whether it is a transcription, translation, comment, edit, etc. applications may apply rr:working or rr:releasing. Not having either of these Motivations means nothing—it is simply unhelpful, and is the default of current encoding platforms. The thinking up to this point has been that annotations would always be sequestered until they are ready for publication and anything found in the wild is official. Linked Open Data, however, thrives on connections and the interesting things that can happen when many iotas of half-completed assertions and descriptions hit critical mass. As a platform, RERUM cannot hide anyone’s work and sincerely commit to open data and interoperable exchange, but it would be foolish to release users’ assertions into the wild knowing that convention assumes these objects have been imbued with a complete academic soul.

The rr:working Motivation is simple. The statement is just that the content of the annotation is not intended for publication—it is shared in good faith for use in discovery or as a seed for further work only. An rr:working object is subject to change and may be completely reversed, disowned, or deleted in the future. It is just a conversation over coffee, albeit one that establishes history and maintains full attribution in its encoding.

The rr:releasing Motivation is not the inverse of rr:working, though they are semantically exclusive. It is a full mark of confidence from the creator and assurance from RERUM that it will be available at this location in perpetuity. When an application marks something as published, no changes may be made to it—full stop.

In this way, strange new combinations of publications may arise. A museum may “publish” their Manifest, offering their imprimatur on the canvases, labels, and descriptions within. Then this Manifest may be offered up for crowd-sourced transcription without also implying that the museum stands behind the evolving transcription data. A scholar who is working their way through a difficult document can publish a chapter on which he would like to plant his flag while continuing to work on the rest of the document, years before publishing the critical edition of the work. It should be noted that RERUM does not require these motivations be included in the main body of the object if the schema is being controlled and utilizes properties in the __rerum property covered in the API to raise the same effect.

Postscript: overwrite

This final option for applications is unique to digital objects and to be used very sparingly. By throwing the ?overwrite switch when making an update request, the application acknowledges that the existing version of the object is flawed or so irrelevant as to be useless. In this case only, there is no new version made; the original is updated and the private __rerum metadata is updated with the date of the overwrite. Most cases where this is required, best practices should have handled it before committing the change, but RERUM acknowledges that human error will sometimes undermine machine obedience.

Editing Remote Objects in RERUM

 View full catalogue record

One use case that has recently captured our imagination in the Center is that posed by the updating of otherwise inaccessible objects. For example, if a user found a transcription annotation at the Wellcome Library which they wanted to update, but there was no accessible annotation service mentioned, that user may find help in a Rerum-connected application. Without specific accommodations, it is likely that we have just encountered another “cut and paste” disjunction when the annotation is created anew in a system without any organized reference to its provenance.

Trust and Authentication

Trust is a complex thing on the Internet. Applications generating annotations (or any other object) are free to assert an appropriate seeAlso or derivedFrom relationship, but it is as easy to mistake as it is to omit. Further, it would only be needed on a newly created RERUM object, meaning there is no automatic way to detect if the relationship should be expected on a new POST.

Within our ecosystem, there is the __rerum property, which is reliably added to objects and is a Source of Truth for object metadata, such as creation date, versioning, and the generating application. This, then, is where the connection to a remote version of an object must be made, if it is to be trusted by any consumer. Though the internal assertions in each object may carry much more detail and meaning, the repository must have a way to simply reference the origin of an “external” version.

Welcoming Guests

Versioning in Rerum relies on the __rerum.history property and the relationships it identifies (@context):

  • prime is assigned to the first instance of any new object as the root of the version tree—every new version refers to the IRI of this first version;
  • previous is the single version from which this version was derived; and
  • next is an array of any versions for which this is previous.

As all versioning is only maintained through reference, it is critically important that our server accurately process all manipulations to the database to preserve the integrity of the trees. In normal cases, when all versions of an object exist within the same database, there are only three simple configurations (excluding deletions):

  1. Version 1 is the only version generated with the “create” action. It has a value of “root” for and no value for history.previous.
  2. Internal Nodes always have a value for history.previous and with the same IRI in Only an “update” action generates these.
  3. Leaf Nodes are the “latest” versions and resemble internal nodes with an empty array.

Though a user may consider it obvious that this “update” is connected to the annotation collections and the manifests adjacent to it, without encoding that relationship as a previous version, no automation or digital citation is possible. In the semantic history of the RERUM version of a remote resource, a combination of cases 1 and 2 occurs—the Rerum version is clearly derived from another resource, but no record of it exists among the objects improved with the __rerum property.

Updating an Idea

It is the JSON-LD standard that presents the solution and directs us towards the useful fourth configuration. The @id property, which is usually assigned during a create action, is likely to be present on an external JSON-LD resource and must be on any request to the update actions. Our example above, is demonstrative of the flexibility of this system. This IRI is not dereferenceable, meaning that it is only useful within an ecosystem of resources, such as an sc:AnnotationList and sc:Manifest, and the entire “previous version” of what is saved in RERUM cannot be known (for considering changes, attribution, etc.) with only the IRI available. Because it is not required to be dereferenceable, it also is not required to exist digitally—a reference to an undigitized resource or an ad hoc IRI for a concept may be used if the case calls for it. This is especially useful for fragments of monolithic resources or difficult to parse resources, such as those targeted by the tag algorithm. Even if the updating application is not aware of an authoritative id for the object it is submitting, one could be generated and attached to invoke this “update” protocol.

In support of this, the Rerum “create” action will reject any object submitted with an @id property, requiring an “update” instead. Then, in the case that the resource being updated does not reside within the database, our new hybrid configuration initiates the version tree. Though the prime “root,” this version of the object also references the incoming IRI as the previous version of itself, maintaining the connection (as much as is possible) to the external resource, even if no reference is made within the object itself and no descriptive annotation is attached after the fact. The same solution is applied to references to a “root” version that has been deleted after the version tree was established. Deletion takes the version out of the tree effectively making it a remote resource for the new prime version.


Mirador and Rerum Inbox: Improving the LDN Plugin

Two complementary updates have popped up from OngCDH and our friends. The first extends the functionality of the The Mirador LDN Plugin to include more object types and the second is a new interface on the Rerum Inbox website which makes it easier for anyone to post supplemental content for public IIIF Manifests.

Mirador IIIF LDN Plugin

This wonderful contribution to the Mirador community from Jeffrey C. Witt has been on github since July 2017 and accomplishes the simple task of supplementing IIIF Manifest objects with additional information available from Linked Data Notification (LDN) inboxes listed in the Manifests or hosted as a Rerum Inbox.

It was originally created with Tables of Contents in mind and has continually been updated to support the sorts of sc:Ranges objects that make up ToCs and other structures within Manifests. In this way, a scholar (or organization) who has created content that supplements a Manifest to which she has no access (for example, one hosted at a national library) can enhance the available information without imposing on or mirroring the original object in its hosting repository.

The Mirador plugin in use at

The plugin creates a simple UI (above) for exposing to the user when an Announcement is available for the resource loaded into the window and prompting to include it as part of the working object. Having proven itself with Range objects, the next step was to include annotations.

Annotations in IIIF follow the Web Annotation and IIIF Presentation standards. Within IIIF (and Mirador), each sc:Canvas may have an otherContent property that contains an array of sc:AnnotationList objects. Related lists are aggregated into sc:Layer objects which span entire Manifests. These Layers are used for commentary or transcription and are often not part of the Manifests hosted by large institutions, as it is more common that they are the result of research than cataloguing or digitization activities.

As of 20 February 2018, the plugin detects and renders announcements of the types sc:Layer and sc:AnnotationList in addition to the sc:Ranges type previously supported. IIIF Presentation 3.0 objects, such as `as:OrderedCollection` and  `as:OrderedCollectionPage`, are allowed by the Inbox, but have not yet been integrated into the plugin for rendering in Mirador. Moving forward, additional cases for supplementary metadata, canvas images, and publication and review will be considered. Visit the plugin on github to contribute stories or code.

Rerum Inbox

Linked Data Notifications are useless without somewhere to announce them, so the Walter J. Ong, S.J. Center for Digital Humanities created the Rerum Inbox and offer it free to the world as a place to post and retrieve notifications about any resource, even if it doesn’t declare the service itself. Since August 2017, Rerum Inbox has been accepting notifications through direct HTTP Requests. The website provided the specification and explained the purpose, but otherwise had not been terribly useful.

Inbox Lookup

Now there are two simple interfaces provided for anyone who wishes to explore Linked Data Notifications in the Rerum Inbox without reaching for the command line. The first discovers if the provided resource is targeted by any notifications. Of course IIIF Drag ‘n Drop  is supported, and in truth it is just a UI on the endpoint. We are a long way from critical mass, so it is unlikely a randomly selected resource will have results, but it is a simple way to be reassured that your notification made it in okay.

Create Announcements

If you have just a couple resources to supplement, there is a quick tool on the site to leave a notification without much fuss. Provide the target (a dereferenceable URI will be verified), the actor (if the URI of the person or organization does not reference a label, you may provide one), and the object itself (with a dereferenceable URI, other preferred fields will be filled out automatically) and click “submit” to post a new announcement. For simplicity, this interface creates only default notifications of the type Announce and the motivation supplementing, so if you need something else, you’ll still have to do it yourself.

With these two updates, we hope to increase the use of both the Rerum Inbox and the contributions to these projects by the communities they serve. If you have suggestions about how to improve either of these offerings, please leave a comment below or visit their respective github repositories to leave an issue or pull request.

Identity through Evidence

The Birth of an Identity



Let’s take a real case and find out what it indicates and how it ought to be annotated. This image is the first page of a six page (3 folio) letter from Private Allen Gooch to his family, but let’s not get ahead of ourselves.

Find an artifact

According to, Dayna Gooch Jacobs has a letter from World War I. There are already strong cataloguing and metadata description schemata to make this real thing a well-described lump of matter. At this point, the way in which it is discovered is decidedly analog:

“Hello, Mrs. Jacobs? Would you mind if I stopped by to look at your WWI letter?”

This is inoffensive, but undiscoverable. As it may be described in print catalogs or included on library databases, some anonymous individual on behalf of an institution may apply metadata that aids in the discovery:

“M Librarian, could you help me locate WWI letters from Military Police?”

Such a request is possible in many ways, but is already reliant on a vast knowledge of the vocabularies of subject heading and the decisions of the search application. However, the goal is possibility, not perfection, so let’s let this be good enough and admit that we have reliably entered this letter into the public record.

Establish Uniqueness


In the human world, there is a lot of fuzziness in definition. This is fine because we are comfortable replacing definition with a preponderance of description. Even with a name “Dayna Gooch Jacobs” and a photo (right), to really find the right person, a researcher may want an address, measurements, occupation, even DNA. The digital world solves this with URIs. If I say, “1” and commit to never referring to that “1” again for any other object, I can reliably reference it over and over. In fact, even if the entire resource vaporizes, the corpus of annotations that point to it can still describe it. In this way, an authority file isn’t even necessary, except to assure the uniqueness of the identifier. In JSON-LD, we’re looking at something like this:

{ "@id" : "1" }

This is much better if we upgrade the identifier to something resolvable (or at least unique) like:

{ "@id" : "" } or
{ "@id" : "" }

In the same way, the letter can be identified. There is already a great standard for describing the digital object in IIIF, so if we allow for a URI that stands in for the real object, we can link it to the sc:Manifest that represents the digitization of it:

{ "@id" : "",
  "@type" : "sc:Manifest",
  "label" : "WWI Letter from Private Gooch",
  "pto:Facsimile" : "",
  "sequences" : [ "" ]

So far, so good.

Discovery and Creation

Until I saw this letter on, I didn’t know Dayna Jacobs existed. If a trusted friend or colleague told me she existed, I would conventionally assume personal experience as his evidence unless otherwise clarified. We have filled our human experience with descriptions (beloved by our brains) in place of identifiers and definitions (required by our robots). It is common to make identity-centric decisions such as “Let’s mint URIs for the wives of Henry VIII” or “Let’s create identifiers for every character in The Shoemaker’s Holiday.

The conventional approach to this opportunity to describe is idea-first cataloguing. The Thing is described within a schema through data entry and largely anonymous or impersonal assertion. The underlying assumption is that the item is best described in isolation, then allowing the schemata in which it participates to impose discoverability. Robots appreciate this clean and defined style of categorization; it is similar to the discrete properties used on digital objects. The problem is that the human cataloguer is only passing as robot when she attaches metadata to the item record. A robot would insist a play meet all the defined qualities of the vocabulary term (fabio:play) that alludes to an English word (“play”) before asserting its membership. The human, most likely, is reaching the same conclusion based on a similarity between the tested item and other items already in that same category. The appointment is made based on where others would look for it as much as on any intrinsic properties. In this way, we are accommodating imagined future human researchers and deceiving the computers which rely on the sincerity of our cataloguing.

Then let us reverse the process, requiring some evidence of existence first, and then limiting that identity to an anchor. All the descriptive metadata is promoted to annotations, which can carry with them the attribution of the annotator and the evidence on which it is based. This means that an authoritative catalogue can still identify items which belong within it, but the probabilistic nature of human assertion is honored with multiple and possibly conflicting assertions about various item properties. Two unrelated researchers can annotate a play—one for structure and one for historical locations—without having to dodge each other in a TEI/XML document or create parallel and redundant resources.

Once annotations are easier, scholars can be more fearless in the proliferation of identifiers. New anchors for digital pointers to real things only need some evidence through reference in a known object. At once, the linkages between objects are in place and the next annotation has a reliable place to point to along with some context for the thing. By allowing conflicting descriptions of things, the reliable anchors are flexible and reusable, even as the academic conversation insists on redescribing once obvious facts. Discovery begets creation begets description begets discovery.

Annotation in Practice

Let us allow that the letter itself is already catalogued and described and that a URI has been minted for it. If I would like to know the author, I could read the description. Some library record, metadata set, or dedicated “description” field in the sc:Manifest object is likely to offer the text from the exhibit’s HTML document:

“Letter from Private Allen L. Gooch to his family in Arizona during World War I.  Transcribed by Dayna Gooch Jacobs and in her possession. Slashes indicate line breaks on original letter.

This is helpful to humans, but inaccessible to robots. Moreover, if I wish to criticize any part of this assertion, I have to replace the entire entry or use a selector to carefully tease out the problematic statement. Most importantly, this description is devoid of evidence and attribution, so authority can only be implied from the hosting repository. If this record had already been treated by a cataloguer, it is possible she would have anonymously populated an author field based on the description text alone.

A better description is possible. I would like to add the author as the dc:creator of this item and the derivative digital objects. The prevailing convention adds the property as metadata to the objects as a simple key-value pair. This is the simplest way to render these items in a viewer, but the structure cannot allow for any advantages over the description except for the granularity. The best annotation asserts a creator with evidence and attribution. There are three places to find this evidence:

  1. Transcription—in Dayna Jacobs’s article, the end of the posted transcription reads: “Private Allen L. Gooch Troop A, 314th Military Police, 89th Division.  American Expeditionnary Forces.
  2. Image Annotation—The image resource is a great example of the type of specific evidence useful in these types of assertions. If someone wants to criticize, the image explains that I have determined the creator based on the signature on the letter, which may be forged or otherwise untrustworthy. 
  3. Scholarly Assertion—Without any other evidence, a scholar should be able to stake an assertion on his own reputation. To do so may weaken the suggestion, but it does not eliminate it. In fact, as in mathematics or Wikipedia, it may serve as a beacon to others to find better evidence for the claim.

What needs to be acknowledged here is that no description is absolute. What had been descriptive metadata is now asserted description by some foaf:Agent because of evidence. If I were to resolve the above sc:Manifest at the Smithsonian, I would probably trust it. I would feel better knowing that the Smithsonian is actually asserting the particular piece of metadata I am interested in (date, author, etc.) formally, since it is possible this object was just ingested from some other collection and has not been reviewed. It would be even better to know that Carol, a trusted employee for over 25 years, was the metadata librarian who dated this and several other letters of the same period.

Annotating Discovered Others

This letter this genealogist described, which was written by someone is amazing. Even when only considering its direct content, it has witnessed or is evidence for many discrete nouns.

People, Groups, and Corporations
  • Private Allen L. Gooch
  • The Mother of Private Allen L. Gooch
  • The family and acquaintances of Private Allen L. Gooch
  • Friend and Sergeant to Private Allen L. Gooch
  • Troop A, 314th Military Police, 89th Division
  • U.S. Army
  • “Doll” to Private Allen L. Gooch
Locations and Landmarks
  • Hudson Bay
  • About 25 miles up the Hudson Bay
  • Camp Mills on the Bay
  • Brooklyn Bridge
  • NYC
  • Hachita, NM
  • Duncan, AZ
Materials and Quantifiables
  • Shoes (not enough)
  • Sixshooters (not enough)
  • Flying Machines (about 20)
  • Tent (per)
  • Ships (3 large)
Time and Date
  • Friday, 7 June 1918
  • A week or a month in place
  • When it “rained and rained” just before 7 June
  • Shining sun the next morning with the aircraft
  • Saturday 8 June, when Private Gooch took his 24 hour pass to NYC (prediction)

The assembly of this circumstance requires at least several digital objects that are just as important for the encoding of this letter:

  • sc:Manifest for the ordered images, along with the required oa:Annotation, sc:AnnotationList, and sc:Canvas objects within.
  • The Transcription provided on the page is a text resource that could be better encoded and contains annotations for line breaks within the text.
  • The foaf:Agent object(s) responsible for drafting the blog entry, transcribing, digitizing the letter, and taking responsibility for holding the physical item in collection should have at least one URI to identify them, even if other information is not available.

This is a large, incomplete, and unprioritized set of things that should probably be encoded. Let’s take just one poorly defined thing here and start to describe it.

“This friend of mine/ from hachita is/ a sergeant in my/ troop he is getting/ me a twenty four/ hour pass and I/ am goeing to N.Y./ City tomorrow”

This passage insists that in addition to the author, another fellow exists. From the brief passage, we know:

  1. Male Human Person: “he” pronoun is used and it would be strange if the male author was commanded by a female or non-human at this time.
  2. Sergeant: possibly a specific rank, but it could be one of several. Someone knowledgeable in historic ranks may know the most appropriate one for this reference.
  3. Friend: by the author’s own assertion, so it isn’t something that can be very narrowly defined.
  4. Hachita: there is a strong possibility this is Hachita, NM and is suggested as at least a hometown, if not a birthplace for this man. Further research could add evidence to this descriptor.
  5. Location: we can be confident that as sure as we trust the author, this man is with the troop at Camp Mills on and around Friday, 7 June [1918].
  6. Troop A 314th: through his connection to Private Gooch and “in my/ troop”.

Remarkably, without any name or label to speak of in this letter, we can know quite a bit about this man. In most cataloguing cases, only information deemed relevant would have been included (often abandoning evidence and attribution) and imprecise information (such as “hachita” or “sergeant”) would be unfortunately removed or inaccurately promoted to categorical metadata. Had we begun with just a name on a list, we may never have found this reference. Approaching it thus, we have found an interesting negative space and may go in search of the 1918 MP sergeant from New Mexico in other documents and connect the reference more completely. A focused project or schema may even induce negative space for certain types of things to point out the unknowns. In this case, a Person may require an annotation for label, date of birth, date of death, or nationality, even if the content of such an annotation were null or unknown.

The relationships are also part of the description and several of the nouns from the list above serve as powerful connecting nodes. Humans are always good connectors because we care so much about their connections. Other aggregating locations, such as Troop A, Camp Mills, or NYC work well because otherwise unrelated data can find kin. Dates and times are also useful because they offer marginal characters a role in dominant history. Rigid structures such as XML hierarchies insist on a preference when encoding these structures; with annotations, they can all be explored ad hoc without disrupting the original resources.

The Importance of Evidence

When we build applications to consume digital objects or, in a conversation, are asked to describe something, we certainly list properties, often include relationships, and rarely mention evidence. Look again at Private Gooch’s passage about his sergeant:

“This friend of mine [as his past behaviors align with my definition of friendliness and within the context of this letter]/ from hachita [as he has indicated to me] is/ a sergeant [as stated by the U.S. Army protocols and promotions] in my/ troop [as you can see from his appointment] he is getting/ me a twenty four/ hour pass [personal interview] and I/ am goeing to N.Y./ City tomorrow [probable future based on personal intentions and access to aforementioned pass].”

It is clear that spending time with assumed evidence is inefficient and so convention often steps right past it. However, in scholarly criticism, it is often misalignment of interpretation and available evidence that separates minds. The better documented the obvious facts, the more time is available for deep exploration of real gaps in knowledge or important divergences in interpretation.

The simple object for the sergeant is easily constructed (with @context omitted):

{ "@id" : "",
  "@type" : "foaf:Person",
  "label" : "Troop A 314th Sergeant (unnamed)"  
  "gender" : "male",
  "hometown" : "Hachita",
  "rank" : "sergeant" }

Good encoders would agree here that finding a reliable URI for the values in gender, hometown, or rank, would be better, but there is nothing unacceptable about these strings as they are. Let’s take just one of these properties and start to encode the conventions.

"hometown" : {
    "@value":"Hachita (N.M.)",
    "@id" : "" }

Even with only the Library of Congress link, it is clear that more information is available on this town, which could be helpful in determining if it is the one actually indicated. However, as it is, another researcher would have to reread the original transcription or view the images of the letter to know why Hachita was indicated as the hometown. Since the sergeant URI is discrete from the letter URI, there is no reason that process would be simple or natural. MADS offers a madsrdf:Source definition that allows for citation, so we can start there:

"Source": {
    "citationSource" : "",
    "citationNote" : "my friend from hachita",
    "comment" : "The letter's author is from Arizona and it is very likely that the neighboring mining town of Hachita, NM is the one referred to here." }

This allows us to add a Source for the hometown assertion without breaking any rules of reference. If there were multiple assertions for hometown, the value could easily become an array. Similarly, if there are multiple Sources to reference, it may be a @set instead. The comment property is trouble, though, since it is semantically sensible, but a bit orphaned. The solution may be found either in the value of the citationSource or in the Source itself, just as description fields in other contexts are often very helpful to humans, but a challenge to parse.

As evidence, the citationSource is an oa:Annotation, so plenty of description and attribution can be attached to it. However, the madsrdf:Source itself is intended as “A resource that represents the source of information about another resource,” (#) so it may be simpler to align it with typical oa:Annotation structures:

"evidence" : {
    "@type" : [ "oa:Annotation", "madsrdf:Source ], 
    "citationSource, oa:hasTarget" : "",
    "citationNote, oa:hasBody" : "The letter's author is from Arizona and it is very likely that the neighboring mining town of Hachita, NM is the one referred to here.",
    "annotatedBy" : "",
    "serializedBy" : "",
    "motivatedBy" : "oa:describing" }

Obviously, in this case, the keys are demanding simplification in a context file, but they have been expanded here to show their flexibility. In this example, the user who created the assertion that describes the value for hometown onto the sergeant’s digital anchor is credited, as well as the software that serialized it. The evidence_03 object is not expanded here, but can be any resource, such as a list that selects both the image segment from the sc:Canvas of the specific page and the text of the transcription that details it. Similarly, the citationNote is a short string literal, only readable by humans, but could just as well refer to some defined ontology for assertions or comparisons. The power here is that the description is now defended against weak criticism by explaining its description and intention as well as is possible.

Suddenly, tag:GhostTownWeb:gulino comes along and introduces a challenge by explaining that shows that Hachita has an adjacent “Old Hachita” that may be the one referred to in this case. His evidence is also valid, but instead of just lobbing criticism at, he must place his dissent into the conversation. If he intends to make a parallel claim, he can offer an alternative value for hometown and provide his evidence, letting future scholars decide between them when it is relevant. If he wishes to undermine the assertion user_01 has made, the discrete nature of annotation means the controversy can be logged directly onto that assertion, without requiring a change to the resulting value of hometown. As these annotations chain together, the @graph becomes a record of the academic conversation.

Open Attribution

The idea of citation is not new, but annotation adds an interesting and liberating dimension. Linked Data thrives in a graph, where relationships are equal partners with the nodes they connect. The vast expanse of a Linked Data graph, however, overshadows the significant problem of omitted records and systemic bias. Because annotation can support attribution for claims and creation separately, it is possible to pull nondigital or unwilling records into the graph. In the example above, the assertion that “hachita” meant “Hachita, NM is the sergeant’s hometown was novel and correctly attributed to user_01. However, this was learned from a transcription on a blog that should be credited to As user_01, I can easily create a new transcription object with the content lifted right off the original blog post, but encoded with the page and line annotations that Dayna intended. The resulting annotation is created by me, but attributed to her. If she had no URI before, she will now; if her work was not encoded or discoverable before, it is now. Magic.

Remote attribution becomes particularly powerful when used just as citation. Instead of a simple bibliography, a URI can be discovered or minted for some unphotographed reference book. As the selectors and pointers become more specific, the resolution of information about this completely off-line resource begins to be encoded. The cross-references between archived articles can be encoded and reacted to in a new article and the relationships begin to gain definition retroactively through history. It is a graph without dead-ends.

Towards a Clear Model for Annotation

Through the decoupling of the URI identity and the descriptions of them that are often conflated as metadata, scholarship may empower the digital anchors for real things to be the supernodes of academic discourse instead of the virtual trading cards of research. Proper annotation and attribution creates verbose and well-documented descriptions of objects, real and digital, that persist when ignored and adapt when challenged. Though technically complex, this graph can and should be simplified through various interfaces for consumers who are interested in only certain aspects of the entire corpus of annotations and description.

The biggest change to data that I suggest is that much of the metadata currently attached to records ought to be upgraded to assertions through annotation. A URI, whether resolvable or not is required for each thing. A label of some sort may be helpful, but isn’t sacrosanct. For resources such as digital files, there may be some real metadata to attach, such as modification dates, file format, etc., but nothing that is not encoded in the file itself. Reflexive seeAlso and describedBy links to contributing resources may be welcome, but since the URI may not be an object at all, it is obviously not required.

Given our working example here, the letter, there are a host of URIs which may be held together or distributed all around the academic world. The encoding of the letter is important, but is just one possible expression of the real thing, so having an authoritative URI—and making sure that it is referenced in the derivative digital objects—avoids confusion or fragmentation of the descriptions. Even what feels like real, physical metadata is often found in error when protocols for taking measurements change, original provenance or dating from historic indices or catalogues is challenged, or descriptors like location or subject change because of political and social revolution.

The URI is the node; the annotations describe the node through assertion. The best assertions identify the agent who makes them and connect to other nodes through their evidence. A well encoded network of annotations creates an environment within which data can be visualized and manipulated in interfaces designed for the specific exploration required. Scholars currently delve into an issue by finding research in a relevant subject area and whose context and approach is congruous to their own research. By recording the academic conversation around these nodes of research, individual scholars are empowered to review scholarship in their own context, even if explicit previous scholarship does not exist, and to explore research in tangential and adjacent fields by making it more accessible and universally described. Instead of encoding for computers what we say about things, we should capture the conventions that pervade our conceptualizations—encode what we are thinking.

The nodes that are the most and least described can be tested by robots for certain qualities—chronological clustering, contradictions and controversy, unexpectedly dominant fields of research, connections to previous research subjects—that may make them interesting to scholars, assisting in discovery. Though nodes are technically equivalent to each other, types representing entities such as events, landmarks, people, corporations, compositions, and abstractions will undoubtedly become as heavy in the graph as they are in the human brain. The challenge to digital humanists is to capture the probabilistic nature of assertion without limiting the ability of any scholar to join the conversation or enter a thought into the public discourse. The natural end of every described resource are literal resources—text, audio, images, visualizations—where humanity is often on display and robots are befuddled for now.

November 2018
« Jun    

Follow us on Twitter