MTNT

MTNT: Machine Translation of Noisy Text

Introduction

Break your MT models with MTNT, the Testbed for Machine Translation of Noisy Text! MTNT is a collection of comments from the Reddit discussion website in English, French and Japanese, translated to and from English. The particularity of this dataset is that the data consists of "noisy" text, that exhibits typos, grammar errors, code switching and more. For more details, check out the paper.

Changelog

Data

You can download the data here: MTNT.1.1.tar.gz (md5sum: 8ce1831ac584979ba8cdcd9d4be43e1d)

After extraction with tar xvzf MTNT.1.1.tar.gz, the MTNT folder should have the following structure:

MTNT
├── monolingual
│   ├── dev.en
│   ├── dev.fr
│   ├── dev.ja
│   ├── dev.tok.en
│   ├── dev.tok.fr
│   ├── dev.tok.ja
│   ├── train.en
│   ├── train.fr
│   ├── train.ja
│   ├── train.tok.en
│   ├── train.tok.fr
│   └── train.tok.ja
├── README.md
├── split_tsv.sh
├── test
│   ├── test.en-fr.tsv
│   ├── test.en-ja.tsv
│   ├── test.fr-en.tsv
│   └── test.ja-en.tsv
├── train
│   ├── train.en-fr.tsv
│   ├── train.en-ja.tsv
│   ├── train.fr-en.tsv
│   └── train.ja-en.tsv
└── valid
    ├── valid.en-fr.tsv
    ├── valid.en-ja.tsv
    ├── valid.fr-en.tsv
    └── valid.ja-en.tsv

The monolingual data is distributed with and without tokenization, in raw text format. The parallel data is split into training, validation and test set. Each tsv file contains 3 columns:

  • Comment ID
  • Source sentence
  • Target sentence

Some source sentences are from a same original comment, and you can use the comment ID to group them together and leverage the contextual information.

If you're only interested in the source and target sentence, you can run the split_tsv.sh script to split the files into source and target files.

I have made the data used for pretraining available here: clean-data-en-fr.tar.gz and clean-data-en-ja.tar.gz. This should save you some time if you want to reproduce the setting from the paper.

Examples

Language pair Source Target
en-fr Just got called into work tho so I won’t be in til tomorrow night Mais on vient de m'appeler pour le travail donc je n'y serai pas avant demain soir
fr-en je demande lazil politique pr janluk # Il ressuscitera ! I demand political asylum for jean luc # He will resurrect!
en-ja Sooooooo, he hasn’t had a day off in 36 years? ということは、36年間一度も休まなかったの?
ja-en もう「ネットの噂に反応する企業(団体)wwwwww」て時代じゃないんだよなあ It's not like it's the era of "companies (organizations) reacting to online rumors hahahaha".

 

Leaderboard

This table lists all published results on the MTNT test set. If you want to appear on this table, shoot an email to pmichel1[at]cs.cmu.edu (please include a link/copy of your paper and code).

System en-fr fr-en en-ja ja-en
[Michel & Neubig, 2018] Base 21.77 23.27 9.02 6.65
[Michel & Neubig, 2018] Finetuned 29.73 30.29 12.45 9.82

 

The BLEU scores should be computed according to the guidelines given in the paper: using sacreBLEU on the detokenized output and reference with intl tokenization. Precisely, run:

cat out.detok | sacrebleu --tokenize=intl ref.detok

Where {out,ref}.detokare the detokenized output and reference.

In the case of en-ja only, you should pre-segment the Japanese output with Kytea before running sacreBLEU:

kytea -m /path/to/kytea/share/kytea/model.bin -notags out.detok > out.seg
kytea -m /path/to/kytea/share/kytea/model.bin -notags ref.detok > ref.seg
cat out.seg | sacrebleu --tokenize=intl ref.seg

Code

The code to reproduce the collection process and the Machine Translation experiments is available on github.

Citing

If you use this dataset or the associated code, please cite:

@InProceedings{michel2018mtnt,
  author    = {Michel, Paul  and  Neubig, Graham},
  title     = {MTNT: A Testbed for Machine Translation of Noisy Text},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}
}

Contact

If you have any issue with the data, please contact pmichel1[at]cs.cmu.edu. For any question regarding the code, please open an issue on Github.

License

This data is released under the terms of the Reddit API.