Ong Blog

Exploiting Client-side Line Detection

This continues a previous post that introduces the minor piece of code we’ll be exploring below.

Hello, Old Friend

Recently, two events coincided that inspired me to pull this code back out and take a second look at the process. The first is that our center charged a group of Computer Science majors with improving the effectiveness of our image processing algorithm as part of their senior capstone project; the second was the seemingly sudden burst of HTR promises, which depend on some level of layout analysis to work. In both cases, I was struck that improvements were accomplished in all cases with more processing power and additional layers of analysis. Although more of the outlier cases were falling into scope and well-formed documents were becoming more automatable, the simple cases were moving from realtime (<8 seconds) and into delays of minutes or hours in some cases before interaction with the results became possible. I do not want to diminish the scale of these accomplishments or sound like someone who gripes that I must wait almost two hours to board an airplane that will take me to the other side of the world in half a day. However, there are certainly use cases at the lower end of the complexity spectrum that may not require and cannot benefit from the horsepower being built into these new models.

I honestly don’t know where this sample image came from (aside from the British Library), but it was in my cache when I lost WiFi years ago. It was time to feed this to the machine and see what happened. In short order, I wrote up a function to visualize the sums of the rows and columns to see if the text box seemed to be obvious. The result felt conclusive:

Setting a default threshold of 10% of the busiest row (marked in black beneath the image), the possible columns popped out as expected. I was also pleased to see that candidate rows appear without too much imagination. Obviously, there are some spots, such as the gutter and the page edges that do not represent a text area, but by simply constraining the width of the analysis and expecting the sawtooth of rows, I not only eliminated irrelevant “columns” but was able to detect separation within a column. I can easily imagine bracket glosses, round text paths, or heavily decorated text that would break this, but those are not my target. With no optimization and the inclusion of several heavy visualizations, I was able to render decent candidate annotations for column and line detection in about two seconds. At the lowest resolution, this time was under one-fifth of a second.

Things take a turn

Rather than declare victory, I investigated the minor errors that I was able to find. While I expected and accepted pulled out capitals and paragraph numbers as well as the mischaracterization of a header graphic as text, it bothered me that one pair of lines was joined, though the visualization suggested their separation. I could fiddle with the thresholds to get a better result, but that also thinned the other lines beyond what made sense to me, so it was not a solution. Stepping through the numbers, it seemed that the slight rotation magnified the impact of the ascenders, descenders, and diacritics that busied up the interlinear spaces. It would not be unreasonable for this lightweight tool to require pre-processed images with good alignment, but some simple poking told me the amount that this image was “off” by was just around -.75 degrees which feels close enough for most humans to consider this a good photo. Instead I began to imagine the shadow cast by a rotating text box and experimented with rotations that made the column curve more round or severe.

They were mathematically different, but determining the best fit was becoming more and more complex, which undermined the original purpose. A simple check of the rotation that produced the narrowest column was possible, and seemed to always be true for the best rotation, but automating that step was difficult on multiple columns and it was too easy to miss the best rotation if the interval was set too high. I looked at column widths, row counts, the difference between max and min values for a host of metrics, but nothing reliably predicted the correct rotation.

Always Assume

After carefully recording and comparing characteristics of good and bad fits across several images, I discovered an assumption about manuscripts that I was not yet leveraging—rows are regular. Even with variety, most ruled manuscripts will be dominated by rows of similar heights. I updated the function to select the best rotation based on the minimum standard deviation from the mean value for  row height. This calculation is lightweight for the browser and the rows are already calculated at each step of determining column boundaries, so there was minimal overhead. As a default, I evaluate each degree from -3 to 3 and then rerun around the lowest value with half the interval until the interval is under one-eighth of a degree. Without caching checks or eliminating intermediate renders, this process takes longer, but it regularly finds the best rotation for a variety of documents. On my machine, it takes about 1 millisecond/pixel processed (40 seconds with the sample image), but the back of my envelope records 922 of these tests as redundant, which means a simple caching optimization will put this process under twenty seconds. Using this method, an incredibly problematic folio (microfilm, distorted page, skewed photography, tight lines) is not only rotated well, but is evaluated with incredible precision.

Robert Grosseteste, Eton College Library 8

Full page, rotated 1.375 degrees, 52 rows in 2 columns

Next Steps

This is not remarkable because it is possible, but because it is mathematically simple and reasonable to accomplish on a client. This not only means the transcription (or generic annotation) application does not need to sideload the image to process it, but also that any image can be offered, even one off the local machine or that doesn’t use helpful standards like IIIF. One can imagine this analysis may be available for any image within the browser through a bookmarklet or extension. Once analyzed, these annotations could be sent to a service like Rerum, saved into LocalStorage for later recall, or sent directly into a transcription tool like TPEN.

Within an application, this tool can be even more powerfully used. Without requiring a complex API to save settings, a user may tweak the parameters to serve their specific document and reuse those settings on each page as the interface renders it. Even if the line detection is inaccurate or unused, the column identification may be helpful to close crop an image for translation, close study, or to set a default viewbox for an undescribed page.

This is not part of any active project and just represents a couple days spent flogging an old idea. The whole thing such as it is has a github repository, but isn’t going to see action until there is a relevant work case. What do you think? Is this worth a few more cycles? How would you use a tool like this, if you could do lightweight image analysis just in your browser or offline?

Experimenting with Client-side Line Detection

Does not compute

Using an “old” iPad on a plane to review transcription data was a clarifying task. For all the advances in research technologies, even simple tasks, such as viewing manuscript images on an institution’s website can crash a five year old browser, effectively rendering it inaccessible. I am not willing to accept that the very tools and scripts we have been building to make these resources more interactive and discoverable are also rendering them inaccessible on aging (but still functioning) hardware. There is a place for discussing progressive enhancement design, progressive web applications, and emerging mesh-style protocols like IPFS, but I’m going to be very targeted in this post. The choke point of manuscript image analysis has always been the server-side task of layout analysis (as in our TPEN application) and has been making great advances with the addition of machine learning in computing clusters (Transkribus and others are in the spotlight at the moment). I am calling for an algorithm simple enough to run in the browser of an underpowered machine that can accomplish some simple tasks on “decent” photography.

WiFi not available

Imagine you are in a magic tincan that zips through the air at high speeds and connects you simultaneously to all the world’s knowledge. From these heights you work away, paging through images of a medieval manuscript and transcribing it into a digital language that encodes it for limitless reuse. You are working at t-pen.org not because it is the best at image analysis, but because its servers run an algorithm good enough for your clean document and do so in real time, returning line detection on each page in mere seconds—at least it used to. As the Internet connection gets spottier, the responses become slower. You wait eight seconds… thirty seconds… and then silence. Your mind reels trying to recall that YouTube video you watched on EM waves and to resist blaming this outage on a vengeful god. A full minute without WiFi passes and you realize there is a chemical bomb on your lap that cannot even entertain you. It would have been more reliable to carry a pencil and a sheet from a mimeograph with you than this unusual pile of heavy metals, polymers, and pressure-cooked sand. What else about your life have you failed to question? Do you even really grasp the difference between air speed and ground speed? How planes!?

Dash blocks and breathe

I was unable to answer all these questions for myself, but I did start to wonder about what minimum effective image analysis might look like. Existing algorithms with which I was familiar used very generic assumptions when looking for lines. The truth is that manuscripts can be quite diverse in form, but photographs of them taken for transcription strongly tend towards some similarities. For this experiment, I am dealing with manuscripts where the text is laid out in rectangular blocks and takes up at least a quarter of the image. I wanted to find something that could deal with the dark mattes, color bars, rulers, and other calibration paraphernalia. Ideally, it would be able to find text boxes and the lines within, even if the original image was slightly askew or distorted. Algorithms that looked only for dark areas were confused by mattes and often rated a red block as equivalent to a column of text. Strictly thresholding algorithms lost faded tan scripts on parchment easily. My solution would need to be good enough to run in a vanilla state and quick enough to calibrate for special cases if needed.

I did not look for dark spots, but for “busyness” in the page. While some scripts may have regions of strong linear consistency, most scripts (even character-based ones) are useful by their contrast to the plainness of the support medium.

Sample image processed for “busyness”

I began, on that airplane ride, to write a simple fork of some canvas element JavaScript filters I had bookmarked a long time ago. Simply, I redrew the image in the browser as a representation of its busyness. What I dropped on Plunker when I landed took each pixel and rewrote it depending on the difference between itself and the adjacent pixels on the row. I was excited that with three very different samples, the resulting visualization clearly identified the text block and reduced the debris. By then the plane had landed and I put away my childish fears that technology would ever abandon me.

Finding Value

In the next post, I will discuss why I opened up this old pile of code again to see if I could teach it a few new tricks. I am curious, though, what snippets or small concepts do you have in a dusty digital drawer that might be useful. Use the comments here to advertise the github repo you haven’t contributed in years, but still haven’t deleted.

Rerum Enters Public Alpha

Come one, come all.

screenshot

What is this Rerum of which we Tweet?

Rerum is an open and free repository for all sorts of digital things. Digital anchors for real world objects and encoded assertions from real world people are stored without prejudice. It’s a data ecosystem to bring those unique digital solutions to scholarly demands closer to reality.

Who Needs another Annotation Store?

Rerum exists to lower the barriers of entry into digital tool building for the Humanities for as many as possible. Not everyone has easy access to the skill sets required for even the most straightforward proof-of-concept applications and interfaces needed to go for funding. We asked ourselves what we could do to help. We began with the thing we most continually redesigned for each new project: a backbone to store data and an API for interactions. We are building that backbone on web standards because we too often found ourselves bound up in a project trying to munge old data into a new system or wasting effort converting from one proprietary format into another to allow the use of various plugins, visualizations, or analytics. For Rerum, we implement standards so that other APIs and applications do not have to struggle to find a way in and so developers are immediately comfortable.

We designed and built Rerum as free open digital object store to hold your data. Solid APIs and Web Components accelerate your application development. Open web standards ensure the clearest, most reliable path for development with the opportunity to leverage “third-party” standards-compliant tools and components. Use ours freely forever or deploy your own Rerum environment. If you need it, we built this for you.

Why Do it This Way?

Having committed to provide this service we identified a number of key principles that inform our design and implementation of Rerum:

Annotation is everywhere

Scholarly assertions, descriptive annotations, and even the measurements and notes often mistaken for metadata are all possible vertices for standards like Web Annotation. These sorts of declarations should always be atomized as much as possible in their storage, especially when they are simple to recombine for user applications.

Attribution is essential

Productive communication between researchers requires reliable attribution. When individual  annotations (as above) are each attributed, collaboration becomes natural, and even complex concerns such as authorization are just acts of filtering.

Format is ephemeral

No one can predict the Next Great Format, but history can teach us about what obstacles to migration can be avoided. The simpler it is to abandon, the more likely a format will remain accessible.

Knowledge is controversy

Statements of fact are reference material until challenged. The interaction of experts and the
collisions of theories, evidence, and educated conjecture or theses enlivens research. Recording both the state and the history of the conversation serves knowledge.

Research is decentralized

Experts in any mature field are not only distributed across institutions, but the most influential investigators may move around, as people do. The complex and enforced conventions around citation and Intellectual Property Rights make clear the critical nature of maintaining attribution and encouraging distribution.

Standards facilitate openness

The hope of web standards is a realm with the freedom to create software individually, the practicality of reusing or sharing software and the interchangeability of available functionality between software. for these reasons, we as developers in the field strive to follow the standards emerging for the web and for data. For the challenges our field faces, we often combine RESTful API practices, CORS, Web Annotation, Web Components, IIIFJSONJSON-LD, and Linked Open Data standards together so that APIs and applications we create are automatically applicable to other APIs and applications built under the same guidelines.

Why a Public Alpha?

Although we are moving ahead with several partners to make sure the Rerum design fulfills its promises, this is from us, not for us. Success happens when data is discoverable, accessible, and durable—no matter whose logo is in the footer. We want to give you a space to begin to engage with the service, to start to evaluate how it might prove a useful part of your project.

As an alpha, we should not be a dependency for a production application or exhibit. Our plan is to inspire those dormant ideas that have been idling for lack of resources to spark to a level at which they may seek funding or become otherwise inevitable. For more immediate projects, we have a RERUM version 0, which is currently in use in several projects, but whose API and interactions do not meet the goals we have set for v1 for clarity, flexibility, and time to launch.

How do I start?

Feel free to read up on Rerum at rerum.io or get into the weeds with some of the blog posts below. If you already know you need in, register your application at devstore.rerum.io, pop your API key into you project (or clone ours), and start saving annotations, transcriptions, commentary, manifests, image collections, and more immediately.

The Standards Approach

As developers in the field we want to follow the standards emerging for the web and for data. For the challenges our field faces, we often combine RESTful API practices, CORS, Web Annotation, Web Components, IIIF, JSON, JSON-LD, and Linked Open Data standards together so that APIs and applications we create are automatically applicable to other APIs and applications built under the same guidelines.  The Walter J. Ong, S.J. Center for Digital Humanities at Saint Louis University are facing these challenges in particular as we develop RERUM.

When combining standards questions begin to arise.  Where should a HTTP(S) request following all these standards fail for the requester?  Is it best to pass when possible with warnings or fail anytime applicable via error?  When considering users (including each other), what is the best experience?  Do we have to also find a way to get intention from the user to know if we should fail or do we calculate intention from other factors of a request?  Can we use intention to morph requests to prevent errors?

These questions hold weight because the intersection of standards isn’t always accommodating.  When a user requests a standards compliant API to save the object `{hello:world}`, the request could be compliant with every mentioned standard but fails the IIIF standard.  Likewise, if a user provides a perfectly structured JSON object that feels IIIF but they forgot to include the context then the request fails IIIF, Web Annotations and JSON-LD.  If the user forgets the `Content-Type` header with their request (even if the body is JSON) then the request fails all but one standard.  In all cases, the request could be processed and the object could be saved depending on which standards the API considers strict.

For developers hitting these sticky points when coding, the choice made trickles out to every user.  These types of choices dictate the user experience from open source or proprietary products. When the users are other developers, available software starts to trend towards certain bottlenecks.  The hope of web standards is a realm with the freedom to create software individually, the practicality of reusing or sharing software and the interchangeability of available functionality between software.  We must avoid bottlenecks around these standards to realize this hope and come up with best practices when combining standards in public facing software.

Auth + Attribution of Open Data

Open Data is supposed to be accessible without any constraints to availability.   The idea of authentication around Open Data is an oxymoron, but in practice we have found great benefit for keeping track of who can claim ownership to an object and how we can use ownership to put natural restrictions on the openness of data to make it a more comfortable realm for people.

The restrictions to the openness of data are only necessary to alleviate the human concern of “that’s mine, not yours”.  Even though a piece of data is ‘Open’, if Person A was the creator of it they do not want Person B to be able to overwrite or delete it as this would destroy any evidence that person A had any semblance of ownership to the idea or claim the object represented.

Person A and Person B can have radically different ideas that both contribute to the source being considered.  The best benefit for the source is to aggregate both Person A and Person B’s idea as evidence to itself, while the best benefit for the public is to be able to look at Person A and Person B’s ideas separately and consider them (just like Person B did with Person A to start this whole thing).  The fair result for Person A and Person B is to ensure their idea will always be theirs and always exist as theirs, and only they will be able to remove or change their idea as it exists for the public.

At its simplest description, data is a collection of bits that represent something human readable.  It has no existence or purpose beyond that and has no reason to care about its creator because just existing fulfills its purpose for any resource that may need it.  The human construct of individuality has led to a culture of ownership.  This has created a unique responsibility for Data to be more sympathetic towards the human need for ownership, especially for data that represents an Idea from a Person about a specific Source or Sources.

Persons A-Z expect an idea to exist both as {their} idea and part of the public evidence to a fact or physical reality as a whole.  When an idea is released to the public, anyone should be able to know about it so they can use the evidence captured by the Idea to help support a claim around the same source.  If it is {your} public idea, the public should know it is {yours} and not be able to change that or claim it as {their} own.  Public data, open source, public record are all subject to this anomaly.  Although public, it is somehow attributed with restrictions to claims people can make on an idea or statement that was not {theirs}.

It is not under Open Data’s purview to authenticate, but it can authorize based on attribution and the idea of ownership.  This supports the very human culture of “that’s mine” while also supporting the golden idea of “but I want others to use it freely to help advance themselves” and the human addition of “so long as they continue to credit that it’s mine”.  As long as the Data knows WHO made it, authentication can rest on the software that uses it and the Data itself can remain free and public while at the same time implementing ownership rights.  Ultimately, people want this type of Data to exist like human Ideas and computers are bad at handling abstraction and require solid logic paths underneath that end with yes or no answers.

With that need, we are attempting to handle attribution abstraction with solid logic structures like history trees and versioning trees and implementations of “working data” versus “released data”.  The goal is to find a best practice to make this data more human friendly.  Link Data holds relationships between nodes of Open Data so that Data nodes can be connected to each other as they are naturally or forcibly related.  Each Data node can hold metadata that describes WHO it came from.  Data Trees can be used to hold Data in unique history order so the same Data can be referenced and reused, yet remain individually connected to WHO contributed to the evolution along the way.  Supporting attribution means always knowing WHO is performing an action on a object, and when each object knows WHO performed the action on it there can be a layer of authorization that confirms the WHO acting is the same WHO that created.

Authentication and Attribution in RERUM

Any new web service or application must take a considered look at authorization, authentication, and attribution—authorization, to make changes to data; authentication, to ensure those making changes are known; and attribution, to apply proper credit for contributions. The prevailing practice is to authenticate users within applications and using appropriate context to make attributions. Popular transcription software, like TPEN and FromThePage, rely on user accounts and a private database to authenticate, attaching attribution based on the user’s account information in the interface and whenever the data is exported, for example, as a IIIF manifest document. Our goal to make RERUM a potent supplement to the heavier data APIs these type of interfaces rely on forced us to reevaluate the “obvious” choice to create and authenticate users.

Data Services and the User Problem

For Example

Annotation services like Hypothes.is and Genius follow the user authentication model and work well as plug-ins to other applications, but are never truly integrated. They are much more standards-based and reliable than Facebook comments or modules like Discus, but still require a user account to authenticate. The problem with user accounts is that they expire even faster than users themselves. As an example, let’s consider how analog records are made more durable. At https://wellcomelibrary.org/iiif/b22282804/manifest, there is a document representing a book published in 1862. The artifact itself claims J.C. Clendon, M.R.C.S is the author, but it does not lean only on that assertion. Immediately following the authorship statement is a qualifier connecting Clendon to Westminster Hospital and the Hospital School. As further authentication, the title page also offers the Greenwich Medical Society as witnesses to its reading. Finally, to lend trust to the assertion of the credentials presented, a publisher is also presented.

Analog attribution and authentication

Knowing

In the arc of the history of scholarship, it may be difficult to check the specific credentials of Clendon, or even to establish his uniqueness. However, the historical footprint and trustworthiness of the affiliated institutions may cast a longer shadow. Certainly, the combination of elements creates a useful attribution. As a data service, then, RERUM requires only an application to authenticate itself, allowing it then to assert whatever it may about its own content. In this way, the __rerum property is a trustworthy record of the application asserting the content. If the transcription annotation asserting authorship were saved by Wellcome into RERUM, it would be sent in as:

{    
    @id: "https://wellcomelibrary.org/iiif/b22282804/annos/contentAsText/a0t5",
    @type: "oa:Annotation",
    motivation: "sc:painting",
    resource: {
        @type: "cnt:ContentAsText",
        format: "text/plain",
        chars: "J. C. CLENDON, M.R.C.S.,"
    },
    on: "https://wellcomelibrary.org/iiif/b22282804/canvas/c0#xywh=383,1641,931,65"
}

As an external resource (details), this update action will create a derivative of the (coincidentally non-dereferenceable) annotation and assign a new URI:

{
    @id: "http://store.rerum.io/rerumserver/id/198ab3910ea9d",
    @type: "oa:Annotation",
    motivation: "sc:painting",
    resource: {
       @type: "cnt:ContentAsText", 
       format: "text/plain", 
       chars: "J. C. CLENDON, M.R.C.S.," },        
       on: "https://wellcomelibrary.org/iiif/b22282804/canvas/c0#xywh=383,1641,931,65"
    },
    __rerum: {
        // URI to foaf:Agent for Wellcome
        generator: "http://store.rerum.io/rerumserver/id/28ab39ab5ea8f",
    ... }
}

Among other properties in __rerum  the generator property points to the Agent that was created for the application when it obtained its API key. Now, there is a dereferenceable location available and anyone consuming the annotation can trust Rerum to authenticate the Wellcome Library as the generator and that the content is from them. If any other application makes additional modifications, those versions also identify the generator, allowing for branches or filtering as needed by consuming interfaces.

Resolution

In this way, no user is ever authenticated by Rerum and attribution is pushed only as far as is possible with certainty and reasonable durability. In this case, Wellcome makes no claim within the annotation itself as to its attribution and even looking up the manifest only offers the backing of the Library (which may be enough for many cases). Another example is from TPEN: http://t-pen.org/TPEN/line/101083800. This line of transcription looks very similar to the Wellcome example, but includes _tpen_creator, an undefined, but consistent property. Someone familiar with TPEN’s API could use the “637” value with the /geti?uid= service to discover public user information. The flexibility of the authentication system in Rerum allows, of course, for the well-attributed document that takes full advantage of RDF and the Linked Data graph to connect to meaningful and dereferenceable IRIs, but it also degrades without breaking as the external links become harder to resolve or understand.

Deleted Objects in RERUM

In the last post, we explored how the tree of the version history is healed around a deleted object. In this post, we look more directly at the transformations to the deleted object itself. Let’s take the same abbreviated object to begin:

{ "@id"     : "http://store.rerum.io/rerumserver/id/01",
  "body"    : "content",
  "__rerum" : {
       "generator" : "Application",
       "isReleased": "",
       "history"      : {
           "prime"    : "root",
           "previous" : "",
           "next"     : [ "http://store.rerum.io/rerumserver/id/02" ]
       }
  }
}

The Case for Breadcrumbs

Because we are removing it from the versioning, the original tree structure and its position are irrelevant. However, it must be obvious to consumers that something has changed and that their version of the object is no longer intended to be valid, while offering a way out. Two often used solutions that fail here are removal and flagging. If the object were no longer available at its original @id, interfaces (for example, a reference to a transcription annotation in an online article) would simply break and developers would have no way of recovering the data or seeking an alternative. If a property like deleted:true were added, it would be so ineffective in updating the object that the interface would continue to render out-of-date information unless the developer had the foresight to specifically check for this flag.

To break the response without stranding the client, we need to disrupt the expectations of the interface without discarding the object data. Our solution is to move the entire object down one level into a __deleted property, breaking unaware interfaces, but still allowing developers to access the original object. Consider the deleted version of our sample:

{ "@id"       : "http://store.rerum.io/rerumserver/id/01", //same ID
  "__deleted" : {
       "time"   : "1515183834857", //time of deletion
       "object" : {                //snapshot of object at deletion
           "@id" : "http://store.rerum.io/rerumserver/id/01",
           "body" : "content",
           "__rerum" : {
             "generator" : "Application",
             "isReleased": "", 
             "history" : {
               "prime" : "root",
               "previous" : "",
               "next" : [ "http://store.rerum.io/rerumserver/id/02" ]
             }
           } 
       }
  }
}

In a typical interface from a consuming application, the rendered display would resolve the @id and bind to response.body. This is now undefined, and an aware interface (or a vocal audience) will quickly alert the developer of the change. Even a human who was ignorant of the RERUM API should be able to grok that the “new location” of their data is __deleted.object.body and the real-life meaning of that move should feel like plain English. At this point, the developer can choose to update the display to render the deleted object (hopefully with some notice) or replace it with an adjacent version referred to in the history property.

Results

In this system, there is no way, through the API, to actually obliterate an object or any version of it. This is important because of the interrelationships between all the versions and the risk built into true deletions being committed pell-mell. The MongoDB behind our current version of RERUM still supports complete removal, if it becomes required, but we believe the presented solution will break interfaces appropriately without marooning them.

So what do you think? Does our solution stray too far from convention? Are we inviting disaster trying to enforce standards while messing with objects? Let us know in the comments or submit an issue on Github.

Forgetting Deleted Objects in RERUM

At the Walter J. Ong, S.J. Center for Digital Humanities, we have been working hard on RERUM, the public object repository for IIIF, Web Annotation, and other JSON documents. The latest feature we’ve been diving into for the 1.0 release is DELETE. As is covered in the documentation on Github, there are a few guiding principles that are relevant here:

  1. As RESTful as is reasonable—accept and respond to a broad range of requests without losing the map;
  2. As compliant as is practical—take advantage of standards and harmonize conflicts;
  3. Trust the application, not the user—avoid multiple login and authentication requirements and honor open data attributions;
  4. Open and Free—expose all contributions immediately without charge to write or read;
  5. Attributed and Versioned—always include asserted ownership and transaction metadata so consumers can evaluate trustworthiness and relevance.

Considering

Setting aside for now some of the nuance of the common REST practices and the Web Annotation Standard, we needed to determine what type of request we would accept, what transformation would happen to the data, and how the service would respond to the client application. Authentication (covered in depth in another post) is the first concern, since an Open and Free repository is a target for vandalism. Honestly communicating the “deleted” status of an object without breaking all references to it is served by appropriate Version Control (see here). Combining these concerns, we realized it was not enough to “just remove” or “delete-flag” an object (specifically, a version of an object)—our repository has to heal the version history tree around it.

Let’s start with a abbreviated object (prime shown):

{ "@id"     : "http://store.rerum.io/rerumserver/id/01",
  "body"    : "content",
  "__rerum" : {
      "generator" : "Application",
      "isReleased": "",
      "history"   : {
          "prime"    : "root",
          "previous" : "",
          "next"     : [ "http://store.rerum.io/rerumserver/id/02" ]
      }
  }
}

Whose version history tree (in the simplest complex form) looks like this:

tree
A single object in 9 versions, updated by 3 different applications (A, B, C)

Before attempting any deletion, there are 3 checks that must be passed:

  1. Is the request coming from the original generator?
    In an open system, we offer endless opportunities to extend the work of others, but deleting is destructive, so it may only be done by the application that originally created this version.
  2. Is this version Released?
    RERUM offers few promises to client applications, but the immutability of Released objects is one of them.
  3. Is this version already deleted?
    While it may not have a huge effect to “re-delete” a version of an object, it feels dishonest to the client to do so.

Deleting

There are three possible positions a version may be in (in ascending complexity):

  1. Leaf without descendants: 05, 08, 09 above;
  2. Internal with a parent and at least one child, 02 and 07 being the most complex; and
  3. Root as the __rerum.history.primevalue of “root” indicates in node 01.

All these options follow this process:

In the first two cases, where the deleted version is not “root”, the healing is fairly straightforward. The chain is simply pinched around the version, removing the object—those listed in the h.next array are updated (in h.previous) to point to the version from which the deleted version was derived. The h.next array in that version is updated to bypass the deleted object. In the case of deleting 02, above, 03 and 06 would now claim 01 as a parent and 01 would now have them as children.

If, however, the deleted object is the original instance, the troubling reality is that every version refers to it as the __rerum.history.prime value. Moreover, by deleting it, the client has made an adjustment to reality, removing the timestamped “first moment” of the history. Trusting the client application and holding to our principles of freedom, we must allow this and document it well. For each child of 01, the original tree is broken into a new tree with that node as its root. This feels destructive, but reflects the intent of a client requesting such an action. Within that new tree (02, here), h.prime becomes “root” and the @id of the new root must be populated to all the descendants. The new prime also enters an unusual state (explored more here) where h.prime is “root” and there is a value for h.previous. By referring to the deleted object, it remains possible for an enterprising researcher to reconstruct the historical relationships in this tree, though no query will expose it simply.

Nuking

What of the case where I want to obliterate and entire object, including all versions ever generated? Though a simple case from the perspective of user intention, it becomes logically complex. We have not supplied a shortcut method for this yet, but the logic of the process follows the chart above in iteration. Let’s take the applications A, B, C above and attempt to blow away this unwanted object.

Application A

This application seems to have the most right to commit this act, as the owner of the “prime” instance. However, there are intervening conditions here. No matter if you burn the tree from leaves to root or root to leaves, the result is the same. Removing 01 simply updates the tree to have 02 as prime and all nodes are updated. Removing 02 does the same with the side-effect of creating a tree from 03(A) and 06(B). Continuing to remove the 03 tree is easy, but an attempt to remove the 06 tree fails immediately, as “A” no longer “owns” it. This is appropriate, since “A” was allowed to purge the history of its contributions, but not those derived from them in the meanwhile. Clients using 08 and 09, will now only see the h.prime of 06, though they may traverse the records back to the original prime if they wish.

Application B

This application has the reverse problem from “A” as a derivative creator. No attempt to delete anything above 06 will be possible, full stop. Blowing away the 06 branch, however, will meet with similar constraints as “A,” since “C” has established a derivative branch on 07. As 07 is deleted, those versions in the h.next array will be updated to point to 02 (the result of deleting 06). Once all allowed actions are finished, 09 will have effectively replaced 06 in the 01 tree. No active node within the tree will retain a reference to “B” versions, though (see below) the deleted objects will report their placement.

Application C

This application is the most limited, as it has only contributed one version, a leaf, to the tree. While it is unable to impact the major structures, it is also able (like “B”) to remove itself from the history completely, leaving the only traces of its original participation in the objects themselves.

The Debris

What is left after a successful deletion in the history tree is important to clients who continue to depend on the leaves, but what of those consumers who had referenced the deleted object? Following our principles, we intend to make it obvious to consumers that something has changed and that their version of the object is no longer intended to be valid, while offering a way out.

We think our solution is quite clever, but that’s a post for another day.

Versioning in RERUM

Versioning as it is known in software is simply the process of preserving previous iterations of a document when changes are made. There are many systems available to the developer which differ in centralization, cloning behaviors, delta encoding, etc., but for our purposes, the philosophy and utility should suffice.

From a mile up, versioning resembles different editions of a published work. Even when the new work supersedes the previous, both are maintained so that they can be cited reliably and the changes between them may become an interesting piece of the academic conversation as well. In the digital world, new versions can come quickly and the differences between them are often quite small. However, in an ecosystem like Linked Open Data, that presumes most resources are remote and have a permanent and reliable reference, even small gaps can magnify and create dysfunction.

Permanent Reference

The idea of a PURL for resources on the Internet is an old one. Most HTML pages for records in a collection include a “reference URL” or “cite as” which indicates to the user how to reliably return to the object being viewed. With RERUM, we want to extend the courtesy of a stable home to sc:Manifestoa:Annotation, and similar objects that are more often viewed only as attached to resources like books, paintings, and other records of “real things.”

Upon creation in RERUM, an object (henceforth an annotation, for ease of example) is minted a novel URI (@id, in JSON-LD) where it will always be available. Whenever a trusted application makes a change, a new version is saved and given a new URI, connected to the previous annotation. In most cases, the significant changes will be to the oa:hasBody or oa:hasTargetfields, but any altered property may make an existing reference unreliable, so the new URI is returned to the application while the old reference remains stable.

Branches: the family tree of digital objects

In the wide world, it is likely that two or more applications may be referencing and even updating the same annotation for different reasons. As an example, let’s consider transcription annotations made in a standards-compliant application, such as TPEN, and Mirador, an open viewer for IIIF objects. A museum may digitize a manuscript and open it up for public transcription, creating many varied annotations throughout the document, identifying image fragments and text content. In RERUM, these annotations are not only associated with each other in the Manifest shared by the museum, but every individual annotation attributes itself to the application (if not the user) that created it or offered updates. This case for versioning is simple to see and accommodate.

Now a prestigious paleographer comes along wishing to complete an authoritative edition of this manuscript. She will begin a project in the transcription application using all of the annotations that existed at the time. As she goes through and applies corrections and conventions for expansions and spelling and punctuation, she is updating the annotations, but in her own project, not that of the public project from the museum. This works because the museum projects annotations still exist and the references to them have not changed. If someone else suggests a change to an annotation for which the paleographer has already created a more recent version, the history of the annotation branches, allowing both current versions to share ancestry without collision.

Timeline Breakdown

1. Created four lines on a page in the museum project:

{A1:Museum1}, {A2:Museum1}, {A3:Museum2}, {A4:Museum1}

2. The paleographer begins work and makes a change on the second line:

{A1:Museum1}, {A2.1:Paleographer1}, {A3:Museum2}, {A4:Museum1}

but the museum project remains unchanged.

3. When Museum2 corrects an error in Museum1’s transcription (cooincidentally the same error the paleographer caught):

{A1:Museum1}, {A2.2:Museum2}, {A3:Museum2}, {A4:Museum1}

the public project is updated without breaking the paleographer’s edition.

So the history of A2 looks like this:

Now the model works for everyone. The museum has an open project, the paleographer has a project she controls, and every contribution through history is automatic. The problem that has been introduced, however, is how should a Manifest viewer display these annotations? To show them all in all their versions would quickly overwhelm, but depending on the moment the “most recent” annotation may be from either branch. Moreover, the paleographer may not intend for these annotations, though stored in an open repository, to be considered a finished project, suitable for display or citing. The simplest answer is that a repository can publish their Manifest with all the intended annotations included, but that undermines the possibilities of an open store.

Digital “Micropublishing”

RERUM supports and encourages the use of a few oa:Motivations that have not been used before. In addition to the Motivation on each well-formed annotation that should indicate whether it is a transcription, translation, comment, edit, etc. applications may apply rr:working or rr:releasing. Not having either of these Motivations means nothing—it is simply unhelpful, and is the default of current encoding platforms. The thinking up to this point has been that annotations would always be sequestered until they are ready for publication and anything found in the wild is official. Linked Open Data, however, thrives on connections and the interesting things that can happen when many iotas of half-completed assertions and descriptions hit critical mass. As a platform, RERUM cannot hide anyone’s work and sincerely commit to open data and interoperable exchange, but it would be foolish to release users’ assertions into the wild knowing that convention assumes these objects have been imbued with a complete academic soul.

The rr:working Motivation is simple. The statement is just that the content of the annotation is not intended for publication—it is shared in good faith for use in discovery or as a seed for further work only. An rr:working object is subject to change and may be completely reversed, disowned, or deleted in the future. It is just a conversation over coffee, albeit one that establishes history and maintains full attribution in its encoding.

The rr:releasing Motivation is not the inverse of rr:working, though they are semantically exclusive. It is a full mark of confidence from the creator and assurance from RERUM that it will be available at this location in perpetuity. When an application marks something as published, no changes may be made to it—full stop.

In this way, strange new combinations of publications may arise. A museum may “publish” their Manifest, offering their imprimatur on the canvases, labels, and descriptions within. Then this Manifest may be offered up for crowd-sourced transcription without also implying that the museum stands behind the evolving transcription data. A scholar who is working their way through a difficult document can publish a chapter on which he would like to plant his flag while continuing to work on the rest of the document, years before publishing the critical edition of the work. It should be noted that RERUM does not require these motivations be included in the main body of the object if the schema is being controlled and utilizes properties in the __rerum property covered in the API to raise the same effect.

Postscript: overwrite

This final option for applications is unique to digital objects and to be used very sparingly. By throwing the ?overwrite switch when making an update request, the application acknowledges that the existing version of the object is flawed or so irrelevant as to be useless. In this case only, there is no new version made; the original is updated and the private __rerum metadata is updated with the date of the overwrite. Most cases where this is required, best practices should have handled it before committing the change, but RERUM acknowledges that human error will sometimes undermine machine obedience.

Editing Remote Objects in RERUM

 View full catalogue record

One use case that has recently captured our imagination in the Center is that posed by the updating of otherwise inaccessible objects. For example, if a user found a transcription annotation at the Wellcome Library which they wanted to update, but there was no accessible annotation service mentioned, that user may find help in a Rerum-connected application. Without specific accommodations, it is likely that we have just encountered another “cut and paste” disjunction when the annotation is created anew in a system without any organized reference to its provenance.

Trust and Authentication

Trust is a complex thing on the Internet. Applications generating annotations (or any other object) are free to assert an appropriate seeAlso or derivedFrom relationship, but it is as easy to mistake as it is to omit. Further, it would only be needed on a newly created RERUM object, meaning there is no automatic way to detect if the relationship should be expected on a new POST.

Within our ecosystem, there is the __rerum property, which is reliably added to objects and is a Source of Truth for object metadata, such as creation date, versioning, and the generating application. This, then, is where the connection to a remote version of an object must be made, if it is to be trusted by any consumer. Though the internal assertions in each object may carry much more detail and meaning, the repository must have a way to simply reference the origin of an “external” version.

Welcoming Guests

Versioning in Rerum relies on the __rerum.history property and the relationships it identifies (@context):

  • prime is assigned to the first instance of any new object as the root of the version tree—every new version refers to the IRI of this first version;
  • previous is the single version from which this version was derived; and
  • next is an array of any versions for which this is previous.

As all versioning is only maintained through reference, it is critically important that our server accurately process all manipulations to the database to preserve the integrity of the trees. In normal cases, when all versions of an object exist within the same database, there are only three simple configurations (excluding deletions):

  1. Version 1 is the only version generated with the “create” action. It has a value of “root” for history.prime and no value for history.previous.
  2. Internal Nodes always have a value for history.previous and history.next with the same IRI in history.prime. Only an “update” action generates these.
  3. Leaf Nodes are the “latest” versions and resemble internal nodes with an empty history.next array.

Though a user may consider it obvious that this “update” is connected to the annotation collections and the manifests adjacent to it, without encoding that relationship as a previous version, no automation or digital citation is possible. In the semantic history of the RERUM version of a remote resource, a combination of cases 1 and 2 occurs—the Rerum version is clearly derived from another resource, but no record of it exists among the objects improved with the __rerum property.

Updating an Idea

It is the JSON-LD standard that presents the solution and directs us towards the useful fourth configuration. The @id property, which is usually assigned during a create action, is likely to be present on an external JSON-LD resource and must be on any request to the update actions. Our example above, https://wellcomelibrary.org/iiif/b2228system.nos/contentAsText/a18t0 is demonstrative of the flexibility of this system. This IRI is not dereferenceable, meaning that it is only useful within an ecosystem of resources, such as an sc:AnnotationList and sc:Manifest, and the entire “previous version” of what is saved in RERUM cannot be known (for considering changes, attribution, etc.) with only the IRI available. Because it is not required to be dereferenceable, it also is not required to exist digitally—a reference to an undigitized resource or an ad hoc IRI for a concept may be used if the case calls for it. This is especially useful for fragments of monolithic resources or difficult to parse resources, such as those targeted by the tag algorithm. Even if the updating application is not aware of an authoritative id for the object it is submitting, one could be generated and attached to invoke this “update” protocol.

In support of this, the Rerum “create” action will reject any object submitted with an @id property, requiring an “update” instead. Then, in the case that the resource being updated does not reside within the database, our new hybrid configuration initiates the version tree. Though the prime “root,” this version of the object also references the incoming IRI as the previous version of itself, maintaining the connection (as much as is possible) to the external resource, even if no reference is made within the object itself and no descriptive annotation is attached after the fact. The same solution is applied to references to a “root” version that has been deleted after the version tree was established. Deletion takes the version out of the tree effectively making it a remote resource for the new prime version.

 

August 2018
M T W T F S S
« Jun    
 12345
6789101112
13141516171819
20212223242526
2728293031  

Follow us on Twitter

Newsletter