logo
Today is   Last update 29-10-2014
 
How the Transactions Index is being constructed.

Current Status

The index is now essentially complete and covers all volumes up to 2014.

There is a little tidying up to do on the entries that have punctuation at the start of the line, this causes them to be out of alphabetical order in the listing.

The index is now in one database, because of this a limit has been placed on the search so a maximum of 200 matches are returned per search. The layout is also now slightly different and hopefully better than before. You can scroll through all of them though by using the button at the bottom of the listing which will appear if there are more than 200 matches. It is probably however better to refine your search terms to reduce the number of matches.

Only the 1889 volume is currently on line in full as a trial and links are provided to that from any match that is in that volume. We are currently completing the scans of all volumes which will be put on line as soon as possible, there is no definite time scale for this though this is not likely to be before mid 2015.

Next steps. Authors have been extracted from the index and the next stage will be to allow a search by author. It will also be attempted to improve the search algorithm so that different spellings of words can be found. However both of these will have to wait until we have all the copies of the Transactions on line as that is the first priority.

The rest of this page has been left in place as it will give you an idea of how this index has been built up over the years.

John Steel, Webmaster, October 2014.

The Start

The index project started several years ago when it was decided to put some fairly recent contents pages on the website. This aroused the interest of several people and efforts started to try and improve the amount of information available on line.

John Steel the Webmaster working on the project was then contacted by Martin Pugmire who volunteered the work he had done to extract the Transactions Contents pages from 1975 to 1997. This data was then transcribed by John into web format and placed on the website, though it was not searchable.

Around the same time it was learned that the Indexes supplied by Bill Wiseman and Edited by James Cherry were available in electronic format for 1990 to 2004. At this point they were just put on line in page format and not easily searchable.

1990 to 2004

With the growing interest in having a searchable database of the Transactions Index on line, John Steel undertook to start the exercise by putting the 1990 to 2004 data into an on line database. This was quite a challenge as this Index relied quite heavily on a method of presentation that was difficult to read electronically. Basically the years were in bold text and the remaining information in ordinary text, this together with the use of Italics for certain items made it difficult to transcribe as optical character reading is not in general discriminative of fonts or typefaces. However a transcription was made and then manually edited to a format where it could go on line. However it has not yet been possible to work out programming to enable this period to be searched by year as there are multiple year entries on each line for some items. This will be done when all the other years are available on line as it is felt that it is more useful to get more years on line than to refine this content. [Update: after considerable work a program was developed that could do a lot of the work, but the final checking and tidying up had to be done manually]

1960 to 1989

It was believed that the 1960 to 1989 Index was available in electronic format. After all a word processor had been purchased by the Society for exactly this purpose. However the search for it proved fruitless. In the end it was decided to scan and optically read the index from the printed volume INDEX TO NEW SERIES TRANSACTIONS (1960-1989). By J & J Cherry. The actually scanning was quite successful, though a lot of manual work was required to get it in a suitable format. The biggest issue was the same as for 1990-2004 in that a single line could contain entries for several years, the year being denoted by a two digit bold entry. Unfortunately the Optical Scanning available can not distinguish between bold and ordinary and italic text, though it does make a pretty good job of recognizing the actual characters.

This period has over 41,000 entries and it was necessary to manually go through all the files to extract each multiple line entry and turn it into a single line for each year so that the index can be searched by year. As can be imagined this took considerable time. But eventually the index was placed on line.

Other Information

By this time Anne Hillman had managed to find copies of some other information that had been extracted. This enabled the construction of the partial index from 1901 to 1972, this was essentially a list of contents and authors but was not a full index for this period. However this was scanned and put on line.

1886 to 1900

By now there was great interest in getting all the transactions indexed and this to be placed on line as a research tool. It was obvious there was a great deal of work involved and Mark Brennand was tasked with looking into the possibility of professional assistance for the scanning of the remaining volumes. This proved to be rather difficult, and possibly expensive. The main issue being to get some professional company to scan the available printed index into electronic format that could be used to put it into a database. At the time, some 4 or 5 years ago the facilities for doing this on a personal computer were not good. But new software was being made available and Optical Scanning Accuracy had improved considerably. On that basis John Steel offered to continue with the exercise.

Mark produced a copy of the 1866 - 1900 printed Index, courtesy of Ian Caruana. It was somewhat ragged and water damaged, but given that it was destined to be cut up for scanning, that was not a major disadvantage./p>

This volume had the same drawbacks as the ones worked on already, in that there were multiple year entries on a single line. So the same process was used with much of the work being done manually. This was finally put on line about two years ago.

1901 to 1959

Although many years still needed to be refined, we now basically had a searchable index covering 1866 to 1900 and 1960 to 2007. The only way to fill in that gap was to use the index supplied with each volume. These were photocopied by Mark Brennand and sent to John Steel for scanning and editing. At the time of writing this article (June 2011) 1940 to 1959 has just been put on line, with work continuing on the remaining years. These copies brought new problems to the process. The layout of the index was not consistent, with greater use of abbreviations and concatenating entries onto lines to save space. So although it has made the issue of recording the year and volume easier, the work involved in correcting all the entries far outweighs these savings. Typically each year is taking four or five hours from scanning to database.

How is it done?

The first step is to get the photocopy of the required index. This is then scanned into the Optical Character Recognition program. This scanning step will try and determine the text to scan, which is made a little more difficult by the fact that the index is generally in two columns. It is necessary to check each page to see which areas will be scanned and to adjust the scan area where required.

The scan is now initiated and this is brought into the computer memory. Over a period of time the scanning program builds up a database of words, plus it uses a dictionary for regular words. If it finds any word it does not recognize then that word is presented for editing. More often than not the items that it fails to recognize are numbers, or words it has never encountered before. After this first pass a copy is available for further editing.

It is now necessary to go through the file line by line looking for possible errors and correcting for the layout. For example, if there are 5 entries for Carlisle, they will probably be presented on one line or in one section with Carlisle only mentioned at the beginning. When reading this it is obvious what is intended, however a database needs all information in every entry or it can not be searched, so it is necessary to manually enter all these missing words. In the volumes that have already been done it was possible to write computer routines to assist with this, however with 1901-1959 the style and layout changes so much from year to year it would require a special routine for each year to be developed. It is more practical to do it manually, but it does take time.

Once this file is checked and edited it is proof read again, to try and pick up errors. It is then saved to a text file which is suitable for importing to the database. The import routine is automatic, it will add the year and volume details to each line as it appends it to the database.

What is next

Work will continue on 1901 to 1939 to complete all years. Typically this will take 5 hours per year so with about 40 years to do you can see there is still about 200 hours of work to complete. Once there is a full index available work will start on separating those periods which are not yet separated by year, until we have a full database that can be searched by year as well as item.

At the moment the search routine is simple, so it will only find exact spelling of words, once the database is complete work will be carried out to improve the search algorithm so it will find misspelled works and words that have multiple spellings.

Finally different search options and page layouts will be worked out to facilitate searching for items and printing the results.

There have been around 9000 searches of the database since it was put on line about 2 years ago. Suggestions as to how to improve it are always welcome, as are corrections to the database if errors are found.

 

John Steel, Webmaster 6th June 2011.