Going Beyond SQuAD Part 1: Question Answering in Different Languages

栏目: IT技术 · 发布时间: 6年前

内容简介：It still remains the case in NLP today that the best data is human generated data. SQuAD is impressive in both its scale and the accuracy of its annotations and many teams have tried to replicate its procedure. For example,Machine Translated SQuAD datasets

Human Annotated Data

It still remains the case in NLP today that the best data is human generated data. SQuAD is impressive in both its scale and the accuracy of its annotations and many teams have tried to replicate its procedure. For example, SberQuAD (Russian) and FQuAD (French) generate crowd sourced QA datasets that have proven to be good starting points for building non-English QA systems. KorQuAD (Korean) also replicates the original SQuAD crowd sourced procedure and provides some very interesting insight on how trained QA systems fare in comparison to humans on different types of questions. The authors of FQuAD find that with CamemBERT (a BERT model pre-trained on French), and a dataset that is a quarter the size of the original SQuAD dataset, they are still able to reach approximately 95% of the human F1 performance. The labour intensive nature of native crowd sourced data collection, however, is a limitation to generating a large scale datasets and this has motivated many teams to investigate ways to automatically translate SQuAD.

Comparison of Exact Match (EM) performance on the KorQuAD dataset by type of question-answer pair (Lim et. al. 2019)

Machine Translated Data

Machine Translated SQuAD datasets exists for Korean (K-QuAD), Italian , and Spanish . These are almost always more cost and time efficient especially considering the premium on crowd-sourcing non-English native speakers on platforms such as Mechanical Turk. We at deepset have also experimented with machine translation of SQuAD and have faced the same quality assurance issues that confronted the creators of the aforementioned datasets.

Chief amongst these is the issue of alignment. Though the translation of question and passage is straightforward, it is not always possible to automatically infer the answer span from the translated text since character indices have certainly shifted. Techniques to remedy this include inserting start and end markers that wrap the answer span in the hope that they are maintained after translation. But it is also worth noting that encoder-decoder attention components in modern machine translation models can function as a form of alignment. In cases where the dataset translation is done with full access to a trained model, attention weights can be interpreted as a form of free alignment (c.f. this method ).

Finding the Right Mix

Considering that there is this trade-off between data quality and scale when choosing between human created and machine translated datasets, how can we ensure the best performance in our trained models? In the research literature, there are a few different teams who leverage both kinds of datasets in different ways.

From these data points, it seems fair to say that a dataset of around 25,000 human annotated SQuAD style samples is enough to train a model with at least 90% of human performance.

The creators of FQuAD for example have data in both styles and train three models: one using just the machine translated, one using just the human annotated and one using both kinds of data. Even though the machine translated data adds another 46,000 samples on top of the 25,000 human annotated, they find that a model trained on both performs slightly worse than one trained just on the human annotated data.

K-QuAD is also composed of a mix of machine and human created samples and the researchers behind it experiment with combinations of the data. Ultimately, they find that a mixture of both the human and de-noised machine translated data gives the best performance. And finally, the creators of the Arabic Question Answering dataset also experiment with a mixture of human and machine created samples and for them, the best performance comes from a full mixture of both.

From these data points, it seems fair to say that a dataset of around 25,000 human annotated SQuAD style samples is enough to train a model with at least 90% of human performance. If you only have around 5,000 such samples, augmenting this set with machine translated data may be worth while.

Multilingual Datasets

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

人人都在晒，凭什么你出彩

【美】奥斯丁•克莱恩 / 张舜芬、徐立妍 / 北京联合出版公司 / 2015-4 / 38.00

1. 《纽约时报》、亚马逊畅销书排名第1位、好评如潮的创意营销书。《出版人周刊》称其在社交网络时代“在安全范围内提供了实用的自我营销策略”。 2. TED演讲者创意分享：晒对了，全世界都为你点赞：别人在朋友圈、微博晒自拍、晒孩子、晒吃喝，你来晒创意、晒灵感、晒工作、晒收获，发出自己的声音，找到伙伴，机会也会主动找上门！ 3. 10堂创意课+手绘涂鸦，所有人都能轻松读完、迅速学会的创意小......一起来看看《人人都在晒，凭什么你出彩》这本书的介绍吧!

码农工具