Go directly to Submissions
Similarity is a multi-faceted task, and we will not initially address all of those facets in our shared task. The first version of the shared task will include two elements: Entailment (Natural Language Inference) and Semantic similarity. Future versions of the shared task (or other versions) should, in time, address Lexical and Syntactic similarity in addition to the first two elements. Measures of pragmatic similarity are not currently considered, as there is little published NLP work done in this area to build from, but we do recognize this as an area likely of interest to BTs and suggest that it could be included in future versions if a well-formed task can be agreed upon. Discussion of possible future directions will be at the bottom of this section.
Entailment measures whether two passages are in logical agreement, contradiction, or are neutral. Semantic similarity measures the degree to which the words that two passages are composed of have the same meaning.
Both measurements are necessary as cases like “Peter struck the high priest’s slave” and “the high priest’s slave struck Peter” may be found to largely mean the same thing, semantically, as they each contain the same words (and they both describe striking, violence, the actors are the same … there are a lot of similarities). When we apply an entailment measure, they are not in agreement.
For entailment we will use a simple scale of -1 for contradiction, 0 for neither entails nor contradicts, and 1 for entails. This is the same scale used for entailment datasets like XNLI, just shifted to be zero-centered rather than starting at zero.
For semantics we will use the same scale used by SemEval-2017 Task 1 (among others) scoring two statements on a scale from 0 (completely dissimilar) to 5 (completely equivalent). A few examples are given in the paper linked above, but the rubric is as follows:
|1||Same general topic, but not equivalent|
|2||Not equivalent, but share some details|
|3||Roughly equivalent, important details differ or are missing|
|4||Mostly equivalent, unimportant details differ|
A shared dataset for similarity will be created with the following form:
These additional fields will be included in a master file for record keeping, but not in the shared dataset:
|eng-eng||So I disgraced the dignitaries of your temple; I consigned Jacob to destruction and Israel to scorn.||That is why I have not disgraced your priests; I have decreed that Jacob should increase and that Israel should be exalted.||-1||3||Isaiah 43:28 NIV||curated GPT-3 generated contradiction|
|eng-eng||Then he said to the woman, “I will sharpen the pain of your pregnancy, and in pain you will give birth. And you will desire to control your husband, but he will rule over you.”||To the woman He said: “I will lessen your pain in childbirth; in pain you will bring forth children. Your desire will be for your husband, and he will be ruled by you.”||-1||3||Genesis 3:16 NLT||curated GPT-3 generated contradiction|
|eng-eng||To the woman he said, “I will make your pains in childbearing very severe; with painful labor you will give birth to children. Your desire will be for your husband, and he will rule over you.”||After he said this, God told her, “I am giving you a heavy burden for the time when you are pregnant. Later, when you give birth to a child, you will get much pain. But you will desire your man and he will be your boss.”||1||5||Genesis 3:16 NIV||Genesis 3:16 Yongkom back translation|
|eng-eng||To the woman he said, “I will surely multiply your pain in childbearing; in pain you shall bring forth children. Your desire shall be contrary to your husband, but he shall rule over you.”||After he said this, God told her, “I am giving you a heavy burden for the time when you are pregnant. Later, when you give birth to a child, you will get much pain. But you will desire your man and he will be your boss.”||1||4||Genesis 3:16 ESV||Genesis 3:16 Yongkom back translation|
|eng-eng||God brought them out of Egypt; he hath as it were the strength of an unicorn.||God had nothing to do with their exodus.||-1||1||Numbers 23:22 KJV||curated GPT-3 generated contradiction|
|eng-eng||For I know the plans I have for you, declares the LORD, plans for welfare and not for evil, to give you a future and a hope.||I know the plans I have for you, declares the LORD, plans for evil and not for welfare, to give you a future and a hope.||-1||3||Jeremiah 29:11 ESV||curated GPT-3 generated contradiction|
|eng-eng||Blessed are they that hunger and thirst after righteousness: for they shall be filled.||Blessed are they that are hungry and thirsty for power: for they shall be filled.||0||3||Matthew 5:6 ASV||curated GPT-3 generated contradiction|
|eng-eng||God blesses those who hunger and thirst for justice, for they will be satisfied.||Whoever has big desire to live only righteously, God has blessed, because he will give them only these righteous intentions/purposes and they will become only sufficient/able.||1||3||Matthew 5:6 NLT||Matthew 5:6 Yongkom back translation|
|eng-eng||From the days of John the Baptist until now the kingdom of heaven suffers violence, and violent men take it by force.||The kingdom of heaven is a peaceful place.||-1||1||Matthew 11:12 NASB||curated GPT-3 generated contradiction|
|eng-eng||From the days of John the Baptist until now the kingdom of heaven has suffered violence, and forceful people lay hold of it.||From the time that Yon Baptaiser lived coming to today, violent people are wanting to ruin/destroy and rule over the place of God becoming head and ruling over!||1||4||Matthew 11:12 NET||Matthew 11:12 Yongkom back translation|
|eng-eng||From the days of John the Baptist until now, the kingdom of heaven has been forcefully advancing, and forceful men lay hold of it||From the time that Yon Baptaiser lived coming to today, violent people are wanting to ruin/destroy and rule over the place of God becoming head and ruling over!||-1||3||Matthew 11:12 NIV (1984)||Matthew 11:12 Yongkom back translation|
Some sources of auxiliary data are provided in the Literature Review. If we are able to keep a similar scoring rubric, this data should be suitable for use as additional labelled data in training a model. If we depart from the existing rubrics, this data would likely still useful in a semi-supervised model architectures. Whenever auxiliary data is used, we ask that contributors disclose which auxiliary data they have used in the description of their model, with links to the data. If proprietary auxiliary data is used and cannot be accessed by all participants, these models will be annotated on the leaderboard.
When submitting a method to be evaluated, contributors should use their method to produce a predictions file with two comma separated label predictions per line with entailment first and semantics second. Here’s an example of a file:
-1, 4 0, 3 1, 5 -1, 2 0, 0 ...
Participants wishing to submit a model that only assesses entailment or semantics should include no data or a blank space with the comma still present.
Entailment only predictions:
-1, 0, 1, -1, 0, ...
Semantics only predictions:
, 4 , 3 , 5 , 2 , 0 ...
These methods will be automatically compared with the gold standard references using an evaluation script. In keeping with the standards of measurement used for similar datasets, we will use Accuracy ( (True Positives + True Negatives)/Total) for measuring entailment and Pearson Correlation for measuring semantic similarity.
To submit a method for review, such that it can be added to the evaluation leaderboard on the community of practice website, submit a pull request to this repository. The files added in your PR should be structured as follows:
. ├── comprehensibility/ ├── naturalness/ ├── similarity/ │ ├── README.md # MODIFIED - Add your method to the methods navigation │ ├── lit_review.md │ ├── evaluation.md │ ├── validation.md │ ├── previously-contributed-method1/ │ ├── previously-contributed-method2/ │ └── <your method name>/ # NEW - A directory for your method │ ├── README.md # NEW - A README describing your method and relevant links │ ├── Dockerfile # NEW - A Dockerfile to build a portable implementation of your method │ ├── directory_or_file1 # NEW - source directories and files implementing your method │ ├── ... # NEW - source directories and files implementing your method │ └── directory_or_fileN # NEW - source directories and files implementing your method ├── readability/ ├── backtranslation/ ├── embeddings/ └── README.md
README.md file should follow the structure and content of this template. The
Dockerfile should allow one to build a portable image that runs your method using Docker. Specifically the docker image for your method should be build as follows:
$ docker build -t my-method
And anyone should be able to run it on data formatted as discussed in Evaluation Data as follows:
$ docker run \ -v /path/on/host/to/evaluation/data:/input \ -v /path/on/host/to/output/predictions:/out \ my-method
where a file (formatted as specified in Scoring) is output to
/path/on/host/to/output/predictions on the host.
Submissions (i.e., PRs) will be reviewed by team members from the AQuA project to verify that the provided links contain sufficient detail for others to understand the technique and, ideally, build upon the results.
In a monolingual context, lexical similarity can identify differences in word choices including potential misspellings or typographical errors. If two words or character strings are compared, the Damerau-Levenshtein (D-L) distance measures the minimum number of deletions, additions, substitutions, or transpositions must be made to get from one character string to another. At the sentence level, we might consider how we could pair groups of words to minimize the total D-L distance across all pairings. This likely is not so complex as to require a machine learning solution, as something like Earth Mover’s Distance adapted to the idea of minimizing scores would probably work well enough. While these would be useful metrics, the results may not completely capture what we care about here. For instance, “don’t” has a D-L distance of one from “donut” and two from “do not”.
And of course in a cross-lingual context, the idea of lexical similarity may seem unusual. It would need to be augmented by some form of correspondence, perhaps a measure of cross-lingual synonymy. Of course, if it was only looking for synonymy that would just be another form of semantic similarity. Presumably, we could do a form of strict forward or backward machine translation and then perform a measure of lexical similarity (as above) with the actual translation. But this still exhibits the donut/don’t problem mentioned previously. We could use methods that combine graph-based methods with D-L distance to distinguish between things that are forms of the same word and things that are different words (see also: lexical entailment, that finds hypernyms, metonyms, etc.). But … it feels like we are getting ahead of ourselves if we’re exploring methods to find the existence of something without a clear statement of the objective. There probably needs to be some discussion here on whether the idea of lexical similarity (which is useful in a monolingual context) is useful in a cross-lingual context. Alternately, are there uses for monolingual similarity (perhaps in ensuring consistency in the translation) that could be useful but not in line with the similarity shared task?
There is currently more literature on the identification of elements of syntax in unsupervised or supervised fashion on multilingual data than there is on syntactic similarity. When syntactic similarity is found, it is often used as a means of improving semantic similarity. Cross-lingual syntactic similarity, however, does appear to be of concern to Bible translators and consultants on its own, as syntactic similarity when it should not be present is often an indication of a translation that needs review. Dependency and constituency parsing each create trees that break up a sentence into labeled parts. We can devise a tree edit distance (similar to D-L distance) that counts the minimum number of tree branch swaps, label changes, label additions, and label subtractions to get from one detected syntax tree to another. In the context of translation, knowing the similarity score between the two languages alone may not be sufficient. Presumably the translator or consultant also wants to know if the syntax identified is expected or unexpected in its respective language. While this is likely to be considered part of the “naturalness” task, it also has bearing on the similarity task. Ultimately, the way these metrics are presented for user consumption may need to differ from the initial task categories. There may be some merit in measuring syntactic similarity as a distinct element from naturalness in organizing the tasks even if the measures are combined in their final presentation.
Elements of discourse analysis and pragmatics may be desireable things to understand for Bible translators and consultants. While there is little research in this area, if a standard can be devised and used to mark a dataset, we should be able to attempt this as a research effort in the future. Any translators or consultants with ideas on what types of similarity here they may like to measure are encouraged to reach out and present examples that might constitute the basis for a dataset. It is possible that through the above measures, some of these types of similarity will be measured incidentally. The amount of information gained for each additional metric should be considered (especially) in areas of new research.