MTNT: Machine Translation of Noisy Text
Introduction
Break your MT models with MTNT, the Testbed for Machine Translation of Noisy Text! MTNT is a collection of comments from the Reddit discussion website in English, French and Japanese, translated to and from English. The particularity of this dataset is that the data consists of "noisy" text, that exhibits typos, grammar errors, code switching and more. For more details, check out the paper.
Changelog
- NEW (05/16/2019): The mtnt2019 test set from the WMT 2019 robustness shared task is now available to download from this link: MTNT2019.tar.gz.
- 01/22/2019: MTNT has been updated to v1.1 to include a fix to a small formatting problem found by Matt Post (thanks!), see the issue on github for more info. The data itself hasn't changed.
Data
You can download the data here: MTNT.1.1.tar.gz (md5sum: 8ce1831ac584979ba8cdcd9d4be43e1d)
After extraction with tar xvzf MTNT.1.1.tar.gz
, the MTNT
folder should have the following
structure:
MTNT
├── monolingual
│ ├── dev.en
│ ├── dev.fr
│ ├── dev.ja
│ ├── dev.tok.en
│ ├── dev.tok.fr
│ ├── dev.tok.ja
│ ├── train.en
│ ├── train.fr
│ ├── train.ja
│ ├── train.tok.en
│ ├── train.tok.fr
│ └── train.tok.ja
├── README.md
├── split_tsv.sh
├── test
│ ├── test.en-fr.tsv
│ ├── test.en-ja.tsv
│ ├── test.fr-en.tsv
│ └── test.ja-en.tsv
├── train
│ ├── train.en-fr.tsv
│ ├── train.en-ja.tsv
│ ├── train.fr-en.tsv
│ └── train.ja-en.tsv
└── valid
├── valid.en-fr.tsv
├── valid.en-ja.tsv
├── valid.fr-en.tsv
└── valid.ja-en.tsv
The monolingual data is distributed with and without tokenization, in raw text format. The parallel data is split into training, validation and test set. Each tsv file contains 3 columns:
- Comment ID
- Source sentence
- Target sentence
Some source sentences are from a same original comment, and you can use the comment ID to group them together and leverage the contextual information.
If you're only interested in the source and target sentence, you can run the split_tsv.sh
script to
split the files into source and target files.
I have made the data used for pretraining available here: clean-data-en-fr.tar.gz and clean-data-en-ja.tar.gz. This should save you some time if you want to reproduce the setting from the paper.
Examples
Language pair | Source | Target |
en-fr | Just got called into work tho so I won’t be in til tomorrow night | Mais on vient de m'appeler pour le travail donc je n'y serai pas avant demain soir | fr-en | je demande lazil politique pr janluk # Il ressuscitera ! | I demand political asylum for jean luc # He will resurrect! | en-ja | Sooooooo, he hasn’t had a day off in 36 years? | ということは、36年間一度も休まなかったの? | ja-en | もう「ネットの噂に反応する企業(団体)wwwwww」て時代じゃないんだよなあ | It's not like it's the era of "companies (organizations) reacting to online rumors hahahaha". |
Leaderboard
This table lists all published results on the MTNT test set. If you want to appear on this table, shoot an email to
pmichel1[at]cs.cmu.edu
(please include a link/copy of your paper and code).
System | en-fr | fr-en | en-ja | ja-en |
[Michel & Neubig, 2018] Base | 21.77 | 23.27 | 9.02 | 6.65 |
[Michel & Neubig, 2018] Finetuned | 29.73 | 30.29 | 12.45 | 9.82 |
The BLEU scores should be computed according to the guidelines given in the paper: using sacreBLEU on the detokenized output and reference with
intl
tokenization. Precisely, run:
cat out.detok | sacrebleu --tokenize=intl ref.detok
Where {out,ref}.detok
are the detokenized output and reference.
In the case of en-ja
only, you should pre-segment the Japanese output with Kytea before running sacreBLEU:
kytea -m /path/to/kytea/share/kytea/model.bin -notags out.detok > out.seg
kytea -m /path/to/kytea/share/kytea/model.bin -notags ref.detok > ref.seg
cat out.seg | sacrebleu --tokenize=intl ref.seg
Code
The code to reproduce the collection process and the Machine Translation experiments is available on github.
Citing
If you use this dataset or the associated code, please cite:
@InProceedings{michel2018mtnt,
author = {Michel, Paul and Neubig, Graham},
title = {MTNT: A Testbed for Machine Translation of Noisy Text},
booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}
}
Contact
If you have any issue with the data, please contact pmichel1[at]cs.cmu.edu
. For any question regarding
the code, please open an issue on Github.
License
This data is released under the terms of the Reddit API.