Similarity ⚖️

Go directly to Submissions

Evaluation protocol for similarity methods

Similarity is a multi-faceted task, and we will not initially address all of those facets in our shared task. The first version of the shared task will include two elements: Entailment (Natural Language Inference) and Semantic similarity. Future versions of the shared task (or other versions) should, in time, address Lexical and Syntactic similarity in addition to the first two elements. Measures of pragmatic similarity are not currently considered, as there is little published NLP work done in this area to build from, but we do recognize this as an area likely of interest to BTs and suggest that it could be included in future versions if a well-formed task can be agreed upon. Discussion of possible future directions will be at the bottom of this section.

Entailment measures whether two passages are in logical agreement, contradiction, or are neutral. Semantic similarity measures the degree to which the words that two passages are composed of have the same meaning.

Both measurements are necessary as cases like “Peter struck the high priest’s slave” and “the high priest’s slave struck Peter” may be found to largely mean the same thing, semantically, as they each contain the same words (and they both describe striking, violence, the actors are the same … there are a lot of similarities). When we apply an entailment measure, they are not in agreement.

Rubrics for Entailment and Semantics

For entailment we will use a simple scale of -1 for contradiction, 0 for neither entails nor contradicts, and 1 for entails. This is the same scale used for entailment datasets like XNLI, just shifted to be zero-centered rather than starting at zero.

For semantics we will use the same scale used by SemEval-2017 Task 1 (among others) scoring two statements on a scale from 0 (completely dissimilar) to 5 (completely equivalent). A few examples are given in the paper linked above, but the rubric is as follows:

score description
0 Completely dissimilar
1 Same general topic, but not equivalent
2 Not equivalent, but share some details
3 Roughly equivalent, important details differ or are missing
4 Mostly equivalent, unimportant details differ
5 Completely equivalent

Evaluation Data

A shared dataset for similarity will be created with the following form:

  • LanguagePair - ISO 639-3 codes for each language, seperated by an en dash (i.e.) eng-eng eng-fra. Back translations should be listed as eng-eng with the backtranslation noted in the references.
  • Reference - The reference passage to be compared
  • Translation - The translated passage, for comparison to the reference
  • EntailmentScore - As above, range -1 to 1
  • SemanticScore - As above, range 0 to 5

These additional fields will be included in a master file for record keeping, but not in the shared dataset:

  • ReferenceSource - A field indicating the source of the reference passage
  • TranslationSource - A field indicating the source of the translation passage
LanguagePair Reference Translation EntailmentScore SemanticScore ReferenceSource ComparisonSource
eng-eng So I disgraced the dignitaries of your temple; I consigned Jacob to destruction and Israel to scorn. That is why I have not disgraced your priests; I have decreed that Jacob should increase and that Israel should be exalted. -1 3 Isaiah 43:28 NIV curated GPT-3 generated contradiction
eng-eng Then he said to the woman, “I will sharpen the pain of your pregnancy, and in pain you will give birth. And you will desire to control your husband, but he will rule over you.” To the woman He said: “I will lessen your pain in childbirth; in pain you will bring forth children. Your desire will be for your husband, and he will be ruled by you.” -1 3 Genesis 3:16 NLT curated GPT-3 generated contradiction
eng-eng To the woman he said, “I will make your pains in childbearing very severe; with painful labor you will give birth to children. Your desire will be for your husband, and he will rule over you.” After he said this, God told her, “I am giving you a heavy burden for the time when you are pregnant. Later, when you give birth to a child, you will get much pain. But you will desire your man and he will be your boss.” 1 5 Genesis 3:16 NIV Genesis 3:16 Yongkom back translation
eng-eng To the woman he said, “I will surely multiply your pain in childbearing; in pain you shall bring forth children. Your desire shall be contrary to your husband, but he shall rule over you.” After he said this, God told her, “I am giving you a heavy burden for the time when you are pregnant. Later, when you give birth to a child, you will get much pain. But you will desire your man and he will be your boss.” 1 4 Genesis 3:16 ESV Genesis 3:16 Yongkom back translation
eng-eng God brought them out of Egypt; he hath as it were the strength of an unicorn. God had nothing to do with their exodus. -1 1 Numbers 23:22 KJV curated GPT-3 generated contradiction
eng-eng For I know the plans I have for you, declares the LORD, plans for welfare and not for evil, to give you a future and a hope. I know the plans I have for you, declares the LORD, plans for evil and not for welfare, to give you a future and a hope. -1 3 Jeremiah 29:11 ESV curated GPT-3 generated contradiction
eng-eng Blessed are they that hunger and thirst after righteousness: for they shall be filled. Blessed are they that are hungry and thirsty for power: for they shall be filled. 0 3 Matthew 5:6 ASV curated GPT-3 generated contradiction
eng-eng God blesses those who hunger and thirst for justice, for they will be satisfied. Whoever has big desire to live only righteously, God has blessed, because he will give them only these righteous intentions/purposes and they will become only sufficient/able. 1 3 Matthew 5:6 NLT Matthew 5:6 Yongkom back translation
eng-eng From the days of John the Baptist until now the kingdom of heaven suffers violence, and violent men take it by force. The kingdom of heaven is a peaceful place. -1 1 Matthew 11:12 NASB curated GPT-3 generated contradiction
eng-eng From the days of John the Baptist until now the kingdom of heaven has suffered violence, and forceful people lay hold of it. From the time that Yon Baptaiser lived coming to today, violent people are wanting to ruin/destroy and rule over the place of God becoming head and ruling over! 1 4 Matthew 11:12 NET Matthew 11:12 Yongkom back translation
eng-eng From the days of John the Baptist until now, the kingdom of heaven has been forcefully advancing, and forceful men lay hold of it From the time that Yon Baptaiser lived coming to today, violent people are wanting to ruin/destroy and rule over the place of God becoming head and ruling over! -1 3 Matthew 11:12 NIV (1984) Matthew 11:12 Yongkom back translation


  • The scores here are assigned by someone with no experience as a translation consultant (me) and reasonable people can disagree. In practice, the data which is included in a “gold score” dataset is only the data which multiple scorers agree on. In order to ensure that we get consistent scoring on issues we care about, we need to provide clear instruction on the rubric to volunteer scorers.
  • The examples here are in English, but we anticipate being able to provide a dataset in multiple high resource languages, as needed.
  • The intent is to provide new releases over time as the dataset grows and as new facets of similarity are added to the dataset.

Auxiliary Data

Some sources of auxiliary data are provided in the Literature Review. If we are able to keep a similar scoring rubric, this data should be suitable for use as additional labelled data in training a model. If we depart from the existing rubrics, this data would likely still useful in a semi-supervised model architectures. Whenever auxiliary data is used, we ask that contributors disclose which auxiliary data they have used in the description of their model, with links to the data. If proprietary auxiliary data is used and cannot be accessed by all participants, these models will be annotated on the leaderboard.


When submitting a method to be evaluated, contributors should use their method to produce a predictions file with two comma separated label predictions per line with entailment first and semantics second. Here’s an example of a file:

-1, 4
0, 3
1, 5
-1, 2
0, 0

Participants wishing to submit a model that only assesses entailment or semantics should include no data or a blank space with the comma still present.

Entailment only predictions:


Semantics only predictions:

, 4
, 3
, 5
, 2
, 0

These methods will be automatically compared with the gold standard references using an evaluation script. In keeping with the standards of measurement used for similar datasets, we will use Accuracy ( (True Positives + True Negatives)/Total) for measuring entailment and Pearson Correlation for measuring semantic similarity.

Submitting a method to be evaluated

To submit a method for review, such that it can be added to the evaluation leaderboard on the community of practice website, submit a pull request to this repository. The files added in your PR should be structured as follows:

├── comprehensibility/
├── naturalness/
├── similarity/
│   ├──                        # MODIFIED - Add your method to the methods navigation
│   ├──
│   ├──
│   ├──
│   ├── previously-contributed-method1/
│   ├── previously-contributed-method2/
│   └── <your method name>/              # NEW - A directory for your method
│       ├──                    # NEW - A README describing your method and relevant links
│       ├── Dockerfile                   # NEW - A Dockerfile to build a portable implementation of your method
│       ├── directory_or_file1           # NEW - source directories and files implementing your method
│       ├── ...                          # NEW - source directories and files implementing your method
│       └── directory_or_fileN           # NEW - source directories and files implementing your method
├── readability/
├── backtranslation/
├── embeddings/

You file should follow the structure and content of this template. The Dockerfile should allow one to build a portable image that runs your method using Docker. Specifically the docker image for your method should be build as follows:

$ docker build -t my-method

And anyone should be able to run it on data formatted as discussed in Evaluation Data as follows:

$ docker run \
    -v /path/on/host/to/evaluation/data:/input \
    -v /path/on/host/to/output/predictions:/out \

where a file (formatted as specified in Scoring) is output to /path/on/host/to/output/predictions on the host.

Submissions (i.e., PRs) will be reviewed by team members from the AQuA project to verify that the provided links contain sufficient detail for others to understand the technique and, ideally, build upon the results.

Guidelines for those contributing methods

  • Solve for shared task evaluation data - all submissions must be attempts to solve for the labels in the shared evaluation data. Do not submit enter submissions intended to glean information about labels in the Gold Standard Dataset,
  • Share how you solved it - you must share information in your PR about how the task was solved. You should link to or attach any corresponding experiment bundles (e.g., from clearML), logs, wiki pages, code repositories, academic papers or blog posts.
  • Share who you are - you must identify yourself and your affiliated organization, so others can ask you about your approach. Anonymous submissions are not allowed.
  • (optional) Share your ideas on visualization/explainability - while we’ve asked you to solve the task, we’d appreciate your thoughts on how the decisions of your model are best explained. For example, it might feature attentional mechanisms to highlight key words that played a factor in the scoring decision. Consider different scales of review: by book of the Bible, by chapter/pericope, by verse.

Future versions

Lexical similarity

In a monolingual context, lexical similarity can identify differences in word choices including potential misspellings or typographical errors. If two words or character strings are compared, the Damerau-Levenshtein (D-L) distance measures the minimum number of deletions, additions, substitutions, or transpositions must be made to get from one character string to another. At the sentence level, we might consider how we could pair groups of words to minimize the total D-L distance across all pairings. This likely is not so complex as to require a machine learning solution, as something like Earth Mover’s Distance adapted to the idea of minimizing scores would probably work well enough. While these would be useful metrics, the results may not completely capture what we care about here. For instance, “don’t” has a D-L distance of one from “donut” and two from “do not”.

And of course in a cross-lingual context, the idea of lexical similarity may seem unusual. It would need to be augmented by some form of correspondence, perhaps a measure of cross-lingual synonymy. Of course, if it was only looking for synonymy that would just be another form of semantic similarity. Presumably, we could do a form of strict forward or backward machine translation and then perform a measure of lexical similarity (as above) with the actual translation. But this still exhibits the donut/don’t problem mentioned previously. We could use methods that combine graph-based methods with D-L distance to distinguish between things that are forms of the same word and things that are different words (see also: lexical entailment, that finds hypernyms, metonyms, etc.). But … it feels like we are getting ahead of ourselves if we’re exploring methods to find the existence of something without a clear statement of the objective. There probably needs to be some discussion here on whether the idea of lexical similarity (which is useful in a monolingual context) is useful in a cross-lingual context. Alternately, are there uses for monolingual similarity (perhaps in ensuring consistency in the translation) that could be useful but not in line with the similarity shared task?

Syntactic similarity

There is currently more literature on the identification of elements of syntax in unsupervised or supervised fashion on multilingual data than there is on syntactic similarity. When syntactic similarity is found, it is often used as a means of improving semantic similarity. Cross-lingual syntactic similarity, however, does appear to be of concern to Bible translators and consultants on its own, as syntactic similarity when it should not be present is often an indication of a translation that needs review. Dependency and constituency parsing each create trees that break up a sentence into labeled parts. We can devise a tree edit distance (similar to D-L distance) that counts the minimum number of tree branch swaps, label changes, label additions, and label subtractions to get from one detected syntax tree to another. In the context of translation, knowing the similarity score between the two languages alone may not be sufficient. Presumably the translator or consultant also wants to know if the syntax identified is expected or unexpected in its respective language. While this is likely to be considered part of the “naturalness” task, it also has bearing on the similarity task. Ultimately, the way these metrics are presented for user consumption may need to differ from the initial task categories. There may be some merit in measuring syntactic similarity as a distinct element from naturalness in organizing the tasks even if the measures are combined in their final presentation.

Pragmatic and other forms of similarity

Elements of discourse analysis and pragmatics may be desireable things to understand for Bible translators and consultants. While there is little research in this area, if a standard can be devised and used to mark a dataset, we should be able to attempt this as a research effort in the future. Any translators or consultants with ideas on what types of similarity here they may like to measure are encouraged to reach out and present examples that might constitute the basis for a dataset. It is possible that through the above measures, some of these types of similarity will be measured incidentally. The amount of information gained for each additional metric should be considered (especially) in areas of new research.