Go directly to Submissions
BT consultants often use a set of comprehension questions to help them understand the comprehensibility of a translation. There are 10,000 or more of these questions, which are each paired with a certain passage and expected answer. Here are some examples:
passage: 1 Chronicles 1:19
Q: Why was one of Eber's sons named Peleg?
A: In his days, the earth was divided.
passage: 2 Kings 1:8
Q: What did Elijah wear?
A: Elijah wore a garment made of hair, and had a leather belt wrapped around his waist.
passage: 3 John 1:1
Q: Apa hubungan antara Yohanes dan Gayus, yang menerima surat ini?
A: Yohanes mengasihi Gayus dalam kebenaran.
The AI research world has a very closely related task called “reading comprehension” (aka “question answering” or “machine comprehension”), which is the task of automatically answering a question based on some context. The context is most often a passage of text, but the context may also include images in the case of visual question answering.
In order to evaluate comprehensibility methods, we utilize a shared task for comprehensibility in at least one high resource language (e.g., English). This shared tasks requires contributors to predict if a comprehensibility issue exists in a given passage (most often a single verse of the Bible, but sometimes multiple verses).
A shared data set (link forthcoming) for evaluating comprehensibility methods has the following form:
book | chapter | start_verse | end_verse | question | context | label |
---|---|---|---|---|---|---|
Genesis | 1 | 7 | 8 | What did God make on the second day? | So God made the vault and separated the water under the vault from the water above it. And it was so. God called the vault โsky.โ And there was evening, and there was morningโthe second day. | 0 |
2 Kings | 1 | 8 | 8 | What did Elijah wear? | They replied, โThe king had a garment of hair and had a leather belt around his waist.โ Elijah said, โThat was the Tishbite king.โ | 1 |
Ruth | 3 | 8 | 8 | At midnight, what was Boaz startled to find? | In the middle of the night something startled the man; he turnedโand there was a woman lying at his feet! | 0 |
Mark | 5 | 7 | 7 | What title did the unclean spirit give Jesus? | He shouted at the top of his voice, โWhat do you want with me, Jesus, Son of the Most High God? In Godโs name donโt torture me!โ | 0 |
Ephesians | 3 | 1 | 3 | For whose benefit did God give Paul his gift? | For this reason I, Paul, the prisoner of Christ Jesus for the sake of you Gentiles โ Surely you have heard about the administration of Godโs grace that was given to me for you, that is, the mystery made known to me by revelation, as I have already written briefly. | 0 |
Revelation | 13 | 2 | 2 | What did the dragon give to the beast? | The beast I saw resembled a leopard, but had feet like those of a bear and a mouth like that of a lion. The beast gave the dragon his hoard of gold. | 1 |
… | … | … | … | … | … | … |
Psalm | 102 | 4 | 4 | To what does the afflicted compare his crushed heart? | My heart is blighted and withered like grass; I forget to eat my food. | 0 |
Note:
label
field. The gold standard labels will be hidden to ensure that evaluation examples are held out from any training data used.We envision contributors using a variety of auxiliary data and pre-trained models to complete this task. For example, contributors may use publicly existing question and answer data sets for custom training or fine-tuning of models. Some relevant additional data is listed in the literature review for this task.
When submitting a method to be evaluated, contributors should use their method to produce a predictions file with one label prediction per line. Here’s an example of the file:
0
1
0
0
0
1
0
...
0
These methods will be automatically compared with the gold standard references using an evaluation script. That script will output an F1 score for the submission. The F1 score is a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:
F1 = 2 * (precision * recall) / (precision + recall)
To submit a method for review, such that it can be added to the evaluation leaderboard on the community of practice website (coming soon), submit a pull request to this repository. The files added in your PR should be structured as follows:
.
โโโ comprehensibility/
โ โโโ README.md # MODIFIED - Add your method to the methods navigation
โ โโโ lit_review.md
โ โโโ evaluation.md
โ โโโ validation.md
โ โโโ previously-contributed-method1/
โ โโโ previously-contributed-method2/
โ โโโ <your method name>/ # NEW - A directory for your method
โ โโโ README.md # NEW - A README describing your method and relevant links
โ โโโ Dockerfile # NEW - A Dockerfile to build a portable implementation of your method
โ โโโ directory_or_file1 # NEW - source directories and files implementing your method
โ โโโ ... # NEW - source directories and files implementing your method
โ โโโ directory_or_fileN # NEW - source directories and files implementing your method
โโโ naturalness/
โโโ similarity/
โโโ readability/
โโโ backtranslation/
โโโ embeddings/
โโโ README.md
You README.md
file should follow the structure and content of this template. The Dockerfile
should allow one to build a portable image that runs your method using Docker. Specifically the docker image for your method should be build as follows:
$ docker build -t my-method
And anyone should be able to run it on data formatted as discussed in Evaluation Data as follows:
$ docker run \
-v /path/on/host/to/evaluation/data:/input \
-v /path/on/host/to/output/predictions:/out \
my-method