Major changes to search functionality

The latest release of The Digital Archives contains a number of important changes to search functionality.

Published at: 2017-11-08

The most important change is allowing users to enable/disable name variant search. We have also made changes to the algorithms which determine search result relevancy. In addition to this, we have improved the information users see about results, included tips on what to do if no results are generated and improved gender-based filtering in the advanced person search.

Name-variant search

Name-variant search has proven to be problematic when combining it with wildcard searches. For new users lacking experience on The Digital Archives, name-variant search can be very helpful, however for more experienced users, it can be a hindrance. We will be expanding the register of name variants to improve this form of searching, but for now, we have made it possible to deactivate the name-variant search manually. In addition, we now display a warning that name-variant search and wildcard combinations can produce less results than expected. We have chosen not to disable name-variant search by default, which will allow wildcard searching of name variants, place of residence and time-periods.

Improved results

When we released DA2017 in june, we adopted the "other" method for searching in global searches, i.e that we changed from an AND-search to an OR-search. If one searches for "Ole Theodor Larsen Oddernes", then the search is actually looking for "Ole or Theodor or Larsen or Oddernes". This is why the number of results grows as you add more search terms.

New formula for calculating relevancy

Using name-variant search means that we cannot use an AND-search in the live search box on the main page. An OR-search with a correctly sorted results list, should regardless show the same results as an AND-search, most prominent in the results list. At least that was the intention. During development, we have uncovered the reason why the results lists do not show what one would expect. This issue should now be resolved with the new release.

Figure 1 shows a search for Roald Amundsen in the new release of The Digital Archives. The standard calculation of relevance in the search engine is greatly affected by the number of times a word is repeated. This means that the result "Knut H. Roald", who was born in Roald, lived in Roald at the Roald Farm located in Roald county, would be highly relevant. The fact that Roald Amundsen hits on multiple words becomes irrelevant because Knut.H.Roald trumps Amundsen on the number of "Roalds". We have now changed the relevancy formula such that the number of unique occurrences of a word determines the result relevancy. If you perform an "AND-search", this will only generate results for persons who have both Roald and Amundsen in their name. With the new formulation for relevancy calculation, we can ensure that the top hits in an "OR-search" are the same as an "AND-search" would have generated.

Improved presentation of number of hits

When you perform a freetext search it is now ran as an OR-search and an AND-search in the search engine. This means that we can give a more accurate report on the number of hits, while still showing the hits which may be relevant in addition. The result list is therefore based on an OR-search, but sorted on relevance calculated using the new algorithm. An example of results lists can be seen in figure 2.

In the results list for the front page search box, we have marked the exact hits with a white background, and the less relevant with grey. This can be seen in figure 3.

Improved gender-based searching

In the advanced person-search, we have now improved the possibilities for searching based on gender. You can choose between "man", "woman", or "unknown". In The Digital Archives material, gender is not always legible, or is obviously incorrect based on other information in the entry. For this reason, we use number of special symbols to represent incorrect, partially legible and illegible information. Previously, gender based searches would only return results where the gender was definitely a man or a woman; represented with a "k" or "m" symbol. "%m%" and "m!!" were not searchable. These symbols actually represent the opposite, as in the source it states "m", but based on other information such as name and occupation/role, it was obviously an error. In such circumstances, we now enter a "k" in the search index. Similarly, "k!!" and "%k%" (not a woman), are entered as an "m" in the search index.

This is a complete list of symbols used for gender representation during transcription, and how we have mapped them to a gender. We have spot-tested the new mappings. Results show that approximately 1 million more entries are searchable with improved gender filtering.

Man: 'm' - mann 'g' - boy 'han' - male 'hf' - man of house 'mk' - male gender 's' - son 'k!!' - not woman '%k%' - not woman

Woman: 'k' - woman 'j' - girl 'hun' - female 'hu' - housewife, lady of house 'kv' - woman 'km' - possibly "womankind". Investigations have revealed that these are used for entries with female names. 'd' - daughter 'm!!' - not a man '%m%' - not a man

Any entries not matching these symbols are categorised as "unknown".

Fixes

This is a complete overview of all fixes and improvements in this release.

Search and result lists

  • Now possible to combine one region with all councils in a region
  • Geographic selections in the search form are generated in a more efficient manner
  • Improved filtering on gender
  • Gender was not previously included when using copy+paste from results lists
  • Added user tips for searches which generate no results
  • Added user tips for advanced searches which generate no results.
  • Added user tips for generic searches which generate no results.
  • Added user tips for basic person searches which generate no results.
  • Added user tips for census searches which generate no results.
  • Added user tips for property searches which generate no results.
  • Locate Source: Hashtag search changed from OR-search to AND-search
  • Locate Source: Now possible to search by archive reference number
  • Added new text to display number of results based on combination of AND and OR-search
  • Enabled toggling of name variant search mode
  • Show warnings when searching with both wildcards and name variant search

Other

  • Corrected an error which caused event date to be displayed incorrectly in search results in pages where the date field also contained a year number
  • Added english translations for a number of geographical entities
  • News items can now be published ahead of time and released at a predetermined date
  • Publication date is now displayed in news items