Gourmet – Neural Machine Translation for low-resource language pairs and domain
Machine translation is an increasingly vital technology for a global and interconnected world. It works very well in situations where there are millions of translated sentences for training models. For low-resourced language pairs, however, the quality of translation is barely, if at all, usable. Our project is focussed on both collecting and creating low-resource language data, and also pushing forward the latest research in machine learning to be able to make the best use of the little data we have.
The media industry is one of the pillars of a functioning democracy and it is increasingly under a range of threats such as political populism and social media content aggregators. Our project will help the media industry to thrive by allowing them to reach a bigger audience with less effort. Our translation models will allow journalists to understand a broad spread of news from countries of interest, and also to produce content faster in local languages by leveraging the output of machine translation models.
GoURMET has three main goals:
- Advancing low-resource deep learning for natural language applications;
- Development of high-quality machine translation for low-resource language pairs and domains;
- Development of tools for media analysts and journalists.
The University of Edinburgh is coordinating the project, and together with the Universities of Alicante and Amsterdam, we are investigating making translation significantly more robust by using the intuition that translated corpora contain enormous redundancies, and are an inefficient way to learn to translate. Inspired by human learning, we are studying methods of building up meaning compositionally. We also leverage another human capacity, the ability to “learn to learn” or to build on knowledge learned in related tasks, by developing machine learning techniques such as transfer learning and data augmentation. This allows us to extract knowledge from monolingual and parallel resources from other languages and domains.
Our media partners, the BBC and Deutsche Welle, are world leaders in adopting and promoting cutting-edge technologies, and they have developed user interfaces and APIs for testing and deploying translation in the newsroom. Currently, we are testing translation models for Gujarati, Swahili, Turkish, and Bulgarian, and we will be delivering Tamil, Amharic, Kyrgyz and Serbian systems soon. By the end of the project we will provide support for 17 language pairs and open source tools for quickly adding new language coverage and evaluating the quality of the resulting translations.