With all our Citizen Historians gallantly tagging away, we thought it was high time we explained how all that hard work is being used to produce the data sets for the project.
While we really appreciate all the effort each and every individual is putting in on the diaries we know that errors can arise for one reason or another. For that reason, we generate what’s known as consensus data. We have an algorithm that allows us to do this.
To begin with, each of the diary pages is tagged by at least five Citizen Historians. Five different people who might each look at that page in a slightly different way. Once that tagging is complete, the diary page is closed and put in the queue for processing.
The system starts this by identifying tags of the same type relating to the same entity (a place, a person, an action etc.). It has to take a best guess at this, clustering tags together based on a percentage of the image size for each scanned diary page. Trial and error has shown that this percentage is best set at 3% vertical and 10% horizontal. There must be a minimum of two tags for a particular entity if it is to make it into the final consensus data set. So, if two of the five Citizen Historians who have tagged a diary page have both identified a place in the same position on that page, that place makes it in.
Image © IWM (Q 5700)
The consensus tag generated from this tag cluster is then placed at the average location in which all of its constituent tags were generated
Next the system has to determine exactly what information should be attached to each tag. This is relatively straightforward when the original tags came from a fixed list (e.g. Activities tags, which can be of only a certain number of types). Where tags contain free text (e.g. person or place), fuzzy text matching is used to determine their attached information (e.g. Slater-Booth, Sclater-Booth and similar variants would be grouped together). Where a majority of these free text tags have the same value, that value becomes the consensus value. However, if there is no clear majority value, then the consensus tag will be formed of the leading variants.
The algorithm is also designed to create serialised data. In essence, this means that each consensus tag is associated with a date, which allows the data generated to then be ordered by date. When Citizen Historians tag dates on a diary page, they essentially segment that page, and it’s that segmentation which allows the system to determine which consensus tags should lie inside which date area.
Once these operations have been carried out for one page of a diary, the next page will be processed and so on until the diary is complete.
Don’t worry about us losing all the tags you’ve generated, though – our databases hold everything that every single one of our Citizen Historians has added to Operation War Diary, be it individual tags, hashtags or text comments. We know just how valuable a resource that’s going to be for anybody wanting to investigate the diaries beyond the standard, structured tags we’ve defined.
Why not check out our first batch of consensus data here: http://wd3.herokuapp.com/public
In the first eight weeks since the launch of Operation War Diary, over 10,000 citizen historians worldwide have tagged names, places and other details in over 200 unit war diaries.
Initial reports reveal some amazing statistics:
- Over 260,000 tags relating to named individuals
- Over 332,000 tags relating to places
- Almost 300,000 tags relating to activities
- The amount of volunteer effort put in so far is equivalent to one person working 40 hours a week for four years.
With your help we’re going one step further than traditional transcription by using the data to digitally map and analyse patterns and trends in the unit war diaries, offering new perspectives on the First World War. Our developers and academic advisory group are hard at work crunching the numbers from the first two months of the project – we’ll blog about their plans soon.
In addition, much of the data – particularly names and places – will be integrated into The National Archives’ catalogue (Discovery), allowing researchers and family historians to search the diaries for named individuals mentioned in the diaries. Making the data freely available to researchers in this way is hugely important to all of the project partners, and we want everyone to be able to benefit from the amazing efforts put in by citizen historians. The data will be published in this way under the Open Government Licence.
The data will also be available to users of Lives of the First World War, which we’ll tell you more about soon.
Find out more about the unit war diaries
Today The National Archives has published another 3,987 digitised First World War unit war diaries from France and Flanders online, which means that around 6,000 diaries are now available on The National Archives’ website to search free of charge (and download for a small fee). These will in time be added to Operation War Diary for tagging.
If you’d never encountered a unit war diary before tagging them for Operation War Diary, you may have been wondering what all the fuss is about and why they’re so important to researchers and historians. They contain a wealth of information of far greater interest than the army could ever have predicted, providing insight into daily events on the front line, and are full of fascinating detail about the decisions that were made and the activities that resulted from them.
If tagging the unit war diaries has inspired you to find out more, whether about a particular unit or an individual, The National Archives has many useful online resources that should help you. Our First World War 100 portal gives an overview of the millions of records that we hold relating to the war, from war diaries to conscription appeals via service records and Cabinet papers. It’s an essential starting point for anyone researching a First World War ancestor, with step-by-step guidance to help you on your way. We also have a wealth of multimedia resources, including podcasts and the popular My Tommy’s War blog series. We send out a free monthly enewsletter with news and updates relating to our work and collection – sign up today to receive your own copy!
The National Archives