Every /lookup or /bulk-lookup search result (see API) returns a search score. This score value is calculated by Apache Solr
and does not have an upper range. For every term in the query and every document in the result, Solr will calculate a
TF*IDF score by multiplying:
- The term frequency: the relative frequency of the term in the document. Solr uses the equation
freq / (freq + k1 * (1 - b + b * dl / avgdl)), where freq = number of occurrences of terms within this document, k1 = term saturation parameter, b = length normalization parameter, dl = length of field and avgdl = average length of field. - The inverse document frequency: a measure of how rare this term is among all documents. Solr uses the equation
log(1 + (N - n + 0.5) / (n + 0.5)), where N = total number of documents with this field, and n = number of documents containing the term.
If multiple terms are matched in the same document, the sum of the score for each term will be used.
The TF*IDF score will be multiplied by several boosts that depend on four factors:
- We index two fields: the "preferred name" of every clique and the "synonyms" of every clique. The preferred name is chosen by Babel, while the synonyms are collected from all the different Babel sources.
- We set up two indexes: a StandardTokenizer that splits the field into tokens at whitespace and punctuation characters, and a KeywordTokenizer that treats the entire field as a single token.
- We use the Query Fields (qf) field to search for the tokens in the index, but we also use the Phrase Fields (pf) field to additionally boost search results where all the tokens are found in close proximity. (NOTE: this might be removed soon.)
- We use the number of identifiers in the clique as a measure of how widely used a clique is. Since some cliques share the same preferred name or label, we can use this to promote the clique most likely to be useful.
We combine these factors in this way in a standard query matches:
| Preferred name match | Synonym match | |
|---|---|---|
| Keyword Tokenizer index | 250x | 100x |
| StandardTokenizer index | 25x | 10x |
And provide additional boosts for phrase matches, boosting synonym matches more than preferred name matches:
| Preferred name match | Synonym match | |
|---|---|---|
| Keyword Tokenizer index | 300x | 200x |
| StandardTokenizer index | 30x | 20x |
Finally, we multiply the total score by the (base 10) logarithm by the number of identifiers in the clique plus one. This boost ranges from log(2) = 0.3 for a clique that only has a single identifier to over log(1000) = 3.