Ong Blog

Exploiting Client-side Line Detection

This continues a previous post that introduces the minor piece of code we’ll be exploring below.

Hello, Old Friend

Recently, two events coincided that inspired me to pull this code back out and take a second look at the process. The first is that our center charged a group of Computer Science majors with improving the effectiveness of our image processing algorithm as part of their senior capstone project; the second was the seemingly sudden burst of HTR promises, which depend on some level of layout analysis to work. In both cases, I was struck that improvements were accomplished in all cases with more processing power and additional layers of analysis. Although more of the outlier cases were falling into scope and well-formed documents were becoming more automatable, the simple cases were moving from realtime (<8 seconds) and into delays of minutes or hours in some cases before interaction with the results became possible. I do not want to diminish the scale of these accomplishments or sound like someone who gripes that I must wait almost two hours to board an airplane that will take me to the other side of the world in half a day. However, there are certainly use cases at the lower end of the complexity spectrum that may not require and cannot benefit from the horsepower being built into these new models.

I honestly don’t know where this sample image came from (aside from the British Library), but it was in my cache when I lost WiFi years ago. It was time to feed this to the machine and see what happened. In short order, I wrote up a function to visualize the sums of the rows and columns to see if the text box seemed to be obvious. The result felt conclusive:

Setting a default threshold of 10% of the busiest row (marked in black beneath the image), the possible columns popped out as expected. I was also pleased to see that candidate rows appear without too much imagination. Obviously, there are some spots, such as the gutter and the page edges that do not represent a text area, but by simply constraining the width of the analysis and expecting the sawtooth of rows, I not only eliminated irrelevant “columns” but was able to detect separation within a column. I can easily imagine bracket glosses, round text paths, or heavily decorated text that would break this, but those are not my target. With no optimization and the inclusion of several heavy visualizations, I was able to render decent candidate annotations for column and line detection in about two seconds. At the lowest resolution, this time was under one-fifth of a second.

Things take a turn

Rather than declare victory, I investigated the minor errors that I was able to find. While I expected and accepted pulled out capitals and paragraph numbers as well as the mischaracterization of a header graphic as text, it bothered me that one pair of lines was joined, though the visualization suggested their separation. I could fiddle with the thresholds to get a better result, but that also thinned the other lines beyond what made sense to me, so it was not a solution. Stepping through the numbers, it seemed that the slight rotation magnified the impact of the ascenders, descenders, and diacritics that busied up the interlinear spaces. It would not be unreasonable for this lightweight tool to require pre-processed images with good alignment, but some simple poking told me the amount that this image was “off” by was just around -.75 degrees which feels close enough for most humans to consider this a good photo. Instead I began to imagine the shadow cast by a rotating text box and experimented with rotations that made the column curve more round or severe.

They were mathematically different, but determining the best fit was becoming more and more complex, which undermined the original purpose. A simple check of the rotation that produced the narrowest column was possible, and seemed to always be true for the best rotation, but automating that step was difficult on multiple columns and it was too easy to miss the best rotation if the interval was set too high. I looked at column widths, row counts, the difference between max and min values for a host of metrics, but nothing reliably predicted the correct rotation.

Always Assume

After carefully recording and comparing characteristics of good and bad fits across several images, I discovered an assumption about manuscripts that I was not yet leveraging—rows are regular. Even with variety, most ruled manuscripts will be dominated by rows of similar heights. I updated the function to select the best rotation based on the minimum standard deviation from the mean value for  row height. This calculation is lightweight for the browser and the rows are already calculated at each step of determining column boundaries, so there was minimal overhead. As a default, I evaluate each degree from -3 to 3 and then rerun around the lowest value with half the interval until the interval is under one-eighth of a degree. Without caching checks or eliminating intermediate renders, this process takes longer, but it regularly finds the best rotation for a variety of documents. On my machine, it takes about 1 millisecond/pixel processed (40 seconds with the sample image), but the back of my envelope records 922 of these tests as redundant, which means a simple caching optimization will put this process under twenty seconds. Using this method, an incredibly problematic folio (microfilm, distorted page, skewed photography, tight lines) is not only rotated well, but is evaluated with incredible precision.

Robert Grosseteste, Eton College Library 8

Full page, rotated 1.375 degrees, 52 rows in 2 columns

Next Steps

This is not remarkable because it is possible, but because it is mathematically simple and reasonable to accomplish on a client. This not only means the transcription (or generic annotation) application does not need to sideload the image to process it, but also that any image can be offered, even one off the local machine or that doesn’t use helpful standards like IIIF. One can imagine this analysis may be available for any image within the browser through a bookmarklet or extension. Once analyzed, these annotations could be sent to a service like Rerum, saved into LocalStorage for later recall, or sent directly into a transcription tool like TPEN.

Within an application, this tool can be even more powerfully used. Without requiring a complex API to save settings, a user may tweak the parameters to serve their specific document and reuse those settings on each page as the interface renders it. Even if the line detection is inaccurate or unused, the column identification may be helpful to close crop an image for translation, close study, or to set a default viewbox for an undescribed page.

This is not part of any active project and just represents a couple days spent flogging an old idea. The whole thing such as it is has a github repository, but isn’t going to see action until there is a relevant work case. What do you think? Is this worth a few more cycles? How would you use a tool like this, if you could do lightweight image analysis just in your browser or offline?

Experimenting with Client-side Line Detection

Does not compute

Using an “old” iPad on a plane to review transcription data was a clarifying task. For all the advances in research technologies, even simple tasks, such as viewing manuscript images on an institution’s website can crash a five year old browser, effectively rendering it inaccessible. I am not willing to accept that the very tools and scripts we have been building to make these resources more interactive and discoverable are also rendering them inaccessible on aging (but still functioning) hardware. There is a place for discussing progressive enhancement design, progressive web applications, and emerging mesh-style protocols like IPFS, but I’m going to be very targeted in this post. The choke point of manuscript image analysis has always been the server-side task of layout analysis (as in our TPEN application) and has been making great advances with the addition of machine learning in computing clusters (Transkribus and others are in the spotlight at the moment). I am calling for an algorithm simple enough to run in the browser of an underpowered machine that can accomplish some simple tasks on “decent” photography.

WiFi not available

Imagine you are in a magic tincan that zips through the air at high speeds and connects you simultaneously to all the world’s knowledge. From these heights you work away, paging through images of a medieval manuscript and transcribing it into a digital language that encodes it for limitless reuse. You are working at not because it is the best at image analysis, but because its servers run an algorithm good enough for your clean document and do so in real time, returning line detection on each page in mere seconds—at least it used to. As the Internet connection gets spottier, the responses become slower. You wait eight seconds… thirty seconds… and then silence. Your mind reels trying to recall that YouTube video you watched on EM waves and to resist blaming this outage on a vengeful god. A full minute without WiFi passes and you realize there is a chemical bomb on your lap that cannot even entertain you. It would have been more reliable to carry a pencil and a sheet from a mimeograph with you than this unusual pile of heavy metals, polymers, and pressure-cooked sand. What else about your life have you failed to question? Do you even really grasp the difference between air speed and ground speed? How planes!?

Dash blocks and breathe

I was unable to answer all these questions for myself, but I did start to wonder about what minimum effective image analysis might look like. Existing algorithms with which I was familiar used very generic assumptions when looking for lines. The truth is that manuscripts can be quite diverse in form, but photographs of them taken for transcription strongly tend towards some similarities. For this experiment, I am dealing with manuscripts where the text is laid out in rectangular blocks and takes up at least a quarter of the image. I wanted to find something that could deal with the dark mattes, color bars, rulers, and other calibration paraphernalia. Ideally, it would be able to find text boxes and the lines within, even if the original image was slightly askew or distorted. Algorithms that looked only for dark areas were confused by mattes and often rated a red block as equivalent to a column of text. Strictly thresholding algorithms lost faded tan scripts on parchment easily. My solution would need to be good enough to run in a vanilla state and quick enough to calibrate for special cases if needed.

I did not look for dark spots, but for “busyness” in the page. While some scripts may have regions of strong linear consistency, most scripts (even character-based ones) are useful by their contrast to the plainness of the support medium.

Sample image processed for “busyness”

I began, on that airplane ride, to write a simple fork of some canvas element JavaScript filters I had bookmarked a long time ago. Simply, I redrew the image in the browser as a representation of its busyness. What I dropped on Plunker when I landed took each pixel and rewrote it depending on the difference between itself and the adjacent pixels on the row. I was excited that with three very different samples, the resulting visualization clearly identified the text block and reduced the debris. By then the plane had landed and I put away my childish fears that technology would ever abandon me.

Finding Value

In the next post, I will discuss why I opened up this old pile of code again to see if I could teach it a few new tricks. I am curious, though, what snippets or small concepts do you have in a dusty digital drawer that might be useful. Use the comments here to advertise the github repo you haven’t contributed in years, but still haven’t deleted.

Pages tools and getting more out of your images

Page tools – CSS3 to the rescue
The page tools split screen was developed to take advantage of CSS3 image manipulation tools. The reason was that while we allow access to 4000+ manuscripts the vast majority of our traffic is private uploads. And those uploads are invariably the poor quality of microfilm scans or photocopies of such scans or even photos taking in poor lighting with handheld cameras and phones. While is wonderful to be able to use the high quality images of  repository we all know that such images are not an option because of availability cost or the resources of the intuition.

While there are a wide variety of image manipulation tools available through CSS3 (although not all browsers support it – we’re looking at you Safari) we quickly narrowed it down to four tools – invert, gray scale, brightness and contrast. We had originally used these tools in our brokenbooks .org project in the a custom Mirador IIIF viewer. (we were very excited that the mirador group has taken that work into the mirador viewer and even expanded it).

To select from the image manipulation options in T-PEN established a number of very simple criteria. Which tools would be the easiest to understand for the user and which of those had the potential to most often give useful results. most tools failed one of these two but passed the other. Taking curves and channel mixing for instance which can give great results in photoshop but a familiarity with those kinds of approaches to image manipulation is required and can be quite a learning curve to master. We could assume that users would come in with an understanding of such things nor should they have to.

These two simple principles did give a short list very fast. –  invert, gray scale, brightness and  contrast. All four pass the first principle easily but the second one is a little more challenging. That is because by themselves they have limited utility. However these four tools work extremely well together. So much so that brightness and contract have almost always been presented together as a combined tool in image manipulation apps for instance. Invert is a surprise for many users. We forget the degree to which we are conditioned the codex. White page with black script. Some of us who are old enough to remember green screens do so so with fondness. White type on a black screen can be easier to read than black type on white. One must remember the human eye deals with projected light a little differently than reflected light. This is especially true if invert is used in conjunction with grayscale and a little brightness/contrast.

Giving it everything you got

If you are on a mac go look at the accessibility options for your display in accessibility on your System Preferences panel because you’ll find invert, grayscale and contrast looking back at you. These represent the simplest way to improve readability. they are not always the best but not everyone has a photoshop guru to help them out.

In the manuscript world where text can often be faded brown on a tan background this combination of tools may surprise you.



NEXT TIME : Page Tools – Editing and manipulating columns.

TPEN Updating the transcription interface. Part 2.

The last blog covered a little bit about what challenge we laid out for ourselves in reworking the T-Pen Transcription interface. We set out to see if we could arrange and reorder the interface to be cleaner, easier to use, improve the access to the hidden tools, privilege the most used tools be more consistent in the tool functioning but not abandon any tool. In the last blog we talked about what we did to support transcription directly. In this blog we will talk a little about how we arranged our tools around transcription and we set the various tools at different distances from the transcription fiction but as a matter of physical layout and through different modes of interaction.

In the last blog we identified a variety of modes of interaction such as split screens, pop overs, redirects into management tools or simple buttons for tool selections. While the list of modes of interaction gave us the greatest opportunity to simply and refine the UX of the transcription tool it was not in the immediate way of reducing the number of ways the user could interact with the interface but rather what was being done with each approach and why. By doing this we were able bring tools together as a matter of their form and function but more importantly we were able to identify the distance the interaction put the user at from transcription and use that as way to give a hierarchy and order to the interface.

(un)Wrapping the Onion.
To organize our hierarchy we identified a series of layers (like an Onion), established the level of focus required for the scholar vis a vie the performance of Transcription and assigned the modes of interaction to support that. The closer to the center a function lies then the less distraction and easier it should be to use. We ended up with transcription, Close Focus/Keyboard, Near Focus/Split Screen and the outer layer of Distant focus/Option tab. 


Transcription we didn’t change much. But we did add auto detect character set so with RTL characters the text box will adjust the presentation of those to show them correctly. This is part of our efforts to broaden the functionality of T-Pen in the coming years in response to requests for such support. We also developed a beta RTL variant that can be activated via the Option Tab but more on that in a later blog.

Close Focus/Keyboard
This is the layer closest to transcription and we placed those tools and features that would be most used during the act of transcription. Ease of view the image for instance. For an example we will use one of the tools mentioned in the last blog on this topic; Peek Zoom (CTRL Shift). This function makes the line being transcribed fit the width of the window. In many cases this means the line is enlarged and presented above the transcription tool. In some cases the line in reduced in size (if the window is narrower than the line for instance) but this means the whole line is visible so that it may help with context for the transcription of an abbreviation. By making this a key command it becomes something the user can do without breaking their focus on the translation. Thus we identified this function as needing to be close to the core function and enabled that through its vitiation via key command. Similarly, ‘special characters'(the first 9 characters at least) and ‘hide workspace’ have key commands to keep the users focus where it should be; on the transcription. We also have option ↑ and option ↓ to help savage lines quickly so as not break the transcribers flow. Although the special characters is not a perfect fit for key command as we will talk about later.

Near Focus/Split Screen
When transcribing in the traditional way the user would occasionally have to break that focus on transcription to check Cappelli or a dictionary, or picked up a magnifying glass to look at something more easily. The user begins to engage withe the manuscript at the more at the page level and less at the single line. In such cases the user disengages from the act of transcription but still remain engaged with parts or th whole of the presented page. The tools and features that fall in this layer can also reduce that tight focus on the act of transcription. We achieved this by using the split screen functionality for this layer. Resources such as dictionaries and Cappelli were already here. Split screen as a mode of interaction allowed us to clean up the interface too as there were a number of existing resources and tools that were split screen. To activate this mode required a mouse action but it didn’t matter either that happened by a button or a pull down. In moving to a pull down we were ablate clean up the interface as well as put related resources and tools together in the same place as well as reduce their visibility a little for that cleaner workspace. But we didn’t find that worked for some tools that made sense to be together but each was too small in themselves to be a single split screen and were in function a little closer to the core function of transcription than the resources in the split screen pull down. These we put together as Page Tools and set this out as a button rather than as part of the split screen drop down to bring the tools in contains a little closer to the user.

Distant focus/Option tab.
This layer is the one closest to the existing version of T-Pen in its form. This is for tow reasons the mode of interaction was very suitable for the the features and tools that fall in that layer but also because we were looking to update the transcription interface not the whole site and this represents the point at which the user is stepped away from the transcription completely and looking to administer the project as a whole rather than perform the act of transcription.

The exception(s) that prove the rule
There are a couple of tools that have not been mention in this blog post that don’t quite fit into these layers. Inspect, Characters and XML tags.
In terms of focus Inspect and XML tags fit into the Near focus zone of our onion.  Characters fits better in the close focus range. Characters, as we have already talked about has key commands mode of interaction for character insertion but there are two major demands that insist we do more than key command. Firstly in the classic version of T-Pen all the buttons could be viewable and secondly any character could be inserted by using the characters as buttons. This was case where if it not broken it is not in need of fixing. The same argument holds for the XML tags. Also the XML tags are more distant from the act of transcription. XML adds to and helps to erode the text but it is not transcription in itself. While many of our users use XML tags, the way in which they use them and the degree to which they use them varies greatly. The XML tags can be used as insertion of opening tag with a closing tag reminder in the bottom left of the text input box or as text highlighted and opening and closing tags inserted at the same time. In either case the user takes a hand off the keyboard to engage withe transcription in a different way that straight input.  This means to bring key commands to the XML tags would be complex, and reduce the ways in which the XML tags cam be inserted for what is a little gain for some users and a loss for a lot of others seems not to be worth the rope. the final rule breaker is inspect. Again the focus is near as the function is to allow the user to look more closely at a detail that the peek zoom or hide workspace options don’t help with so the user must again lift their hand, mentally and physically, away from the transcription to metaphorically and lift a magnifying glass. Putting this in the split screen doesn’t make sense as it burns the function amid resources when it it is not and it stands a little closer to the transcription than the split screen tools do.

All in all the new T-Pen interface is a mixture of changes, continuations and, we hope, clarity for our users.

Next time: Pages tools and getting more out of your images


TPEN Updating the transcription interface. Part 1.

The Center had the good fortune last year to work on an custom imbedded version of T-Pen for the New French Paleography website from Newberry Library. University of Toronto did a great job of building out the site while we turn the backend of T-Pen into web services to allow for more flexible versions of the front end transcription interface to accommodate Newberry’s needs and to better suit early modern French Paleography. This year we are able to bring those changes into

This constitutes the first major update to the interface since its launch and while long past due we were able to use feed back from the last five years to enable us to focus on the needs of our users. So building on that and the work we did for NewBerry we set to work.

Too Many Options. But you need them.
You can do a lot in the original T-Pen transcription interface but for many the first time was a confusing one with so many tools. Newberry wanted some of those tools and some new ones. So while we developed T-Pen(NL) we found ourselves thinking about those tools and their priority in the interface. We though about what we needed to change, add or even delete.  One thing we have come to understand over the lifetime of T-Pen is obvious: some tools are used a lot but some not so much. But that is only half of the equation for making changes, deletions or additions. The second part was why.

In the old interface all the tools are available (with the opportunity to add or edit a few in the project management interface). They are grouped in relatively arbitrary ways and with character and xml tags the transcription tool bar could expand quite a bit. There are also quite a few key command operated tools that many users never knew where there and we wanted to bring them forward enough to be noticed given the positive response we got when we demonstrated them. Another issue was that there were several ways the tools were interacted with: as split screens, pop overs, redirects into management tools or simple buttons for tool selections. These modes of interaction were not necessarily always implemented consistently. So we set out to see if we could arrange and reorder the interface to be cleaner, easier to use, improve the access to the hidden tools, privilege the most used tools be more consistent in the tool functioning but not abandon any tool. The first thing we did for this approach was to clearly mark out what we needed to privilege about all: transcription.

Transcription. Transcription. Transcription.
Transcription is the primary function of T-Pen. When is ceases to be of value for transcription it has no value. There are several strengths to T-Pen as a transcription tool. This covers a wide range of things but for the purpose of reworking the T-Pen transcription interface it is the facilitating of the act of transcription of a text. Everything else that T-Pen allows a user to do must be presented to the user as it relates to that act of transcription.

At its heart there are two things T-Pen does to aid in transcription; its presentation of the content to be transcribed and aiding the typing of the text. Transcription in T-Pen means typing on a keyboard (mostly). So to support that we needed to do what we could to keep the users hands on the keyboard. To help with this T-Pen has key commands like navigating lines by option ↑ and option ↓. We also reduced the space between the top of the Transcription tool and the text box. the first 9 special characters can be typed via CTRL with 1 through 9 to insert the character into the text. It was suggested us that we should have a preset collection based on common usage but given the unexpectedly wide variety of characters used as well as languages and character sets this proved unviable. The introduction of  basic Right to Left character support complicates that even more but does make an argument for dedicated or custom interfaces for different formats and languages. This kind of custom interface is something you will be hearing more of from us over the coming months but is something we have been thinking and planning for for many years and the conversion of our back end into web services is part of that long term objective. more on that in the coming months but for now this blog must return to its purpose. So back to transcription and presentation of the content to be transcribed.

One of the most influential elements on the quality of a transcription is the distance between the text to be transcribed and that is reflected in our placement of the Transcription Bar and its text box immediately below the line being transcribed. But we don’t stop there. We wanted to help the user take advance of the fact the image is a digital image and so we made Peek Zoom (CTRL Shift). This makes the line being transcribed enlarge to fill the width of the window so if your window is wide the line can get substantially bigger. While we have the Inspect button (allows you to magnify parts of the image) on the Transcription Tool Bar this quick key command doesn’t break the flow of transcription by taking the users hands off the keyboard. Similarly the Hide Workspace (ALT CMND) key command hides the Transcription Tool Bar for as long as the user holds ALT CMND which allows the user to look at other details of the image such as the next line to help with context to decode abbreviations or hard to read words on the line being transcribed.

But there are times when the User needs to take their hands from the keys and reach for the mouse and the next bog entry will go into details about how we figured out what should be where and why.


T-PEN Development Advance Post

Screen Shot 2015-11-10 at 11.20.06 AM

The Center for Digital Humanities is excited to announce the resumption of work of the T-PEN project (Transcription for Paleographical and Editorial Notation; Since T-PEN launched in 2012 with generous funding from the Andrew W. Mellon Foundation and the NEH, there have been 1500 unique users working on 2000 projects. New feature development, however, has been unfunded and proceeded at a crawl. Thanks to an investment from the Saint Louis University Libraries and coordination with several smaller funding sources, we are now in a position to both develop a significant improvement to the existing application and begin work on the next version (3.0).

October 2018
« Jun    

Follow us on Twitter