User talk:Xerox iitb

Minutes of the meeting 16-08-2010 for MT Group

In this meeting, we had a tele-conferencing with Shourya and Meera from Xerox, India. The agenda for the meeting is following:
 * Indian Language (specially Hindi) to English Translation


 * Building Statistical dictionary from comparable corpora


 * Building parallel corpora using crowd sourcing


 * Study of translatability between pair of languages


 * Using Xerox Incremental Parser in SMT


 * Building judicial corpus from legal documents

Regarding Crowd Sourcing: Xerox will be providing us with technical support and expert advices but it needs to be setup and tested by students of IITB(chirag). The question on the type of crowd which needs to be targeted for this task ?? If it is for legal domain then we can approach Law schools and lawyers for their participation. For generic domain housewives can play a vital role in achieving the goal. This is a gradual process and will take about 1 year.

XIP parser can be used for adding linguistic information in the aligned parallel corpus in order to make tree aligned corpus. For Hindi corpora, chunker can do the above task.

The question on what is the need to work on legal domain wants an explanation.

Another interesting question addressed was as to why the development of a statistical dictionary from comparable corpora needs to be looked at or how it could be useful. Our end goal is to translate from one langauge to another. What we essentially do to translate a sentence is we identify smaller parts in the source language sentence, get their counterparts in the target language and then we put these pieces together. It is here that we make use of the statistical dictionary to get the corresponding smaller parts in the target language. This would need a large corpus and since we have a lot of information on the web (or elsewhere) where essentially the same information is duplicated in multiple languages, we can leverage these comparable (but may not be exactly sentence aligned) corpora to develop a dictionary from one language to another. If we have good algorithms at hand, they can make the task of building a dictionary for two languages much simpler. It was also made clear that this work is going to be for developing generic algorithms for generating a probabilistic dictionary from comparable corpora regardless of any specific domain. These algorithms, however, can later be used for domain specific tasks.

Task to be done before 30th August Meeting:

@kushal: Presentation on the state of resources and tools in India

@naveed: Presentation on Indian Language to English SMT

@chirag: Presentation on Crowd Sourcing and how it can be done i.e. (Mechanical Turk)

@all: Read Prof. RMK Sinha Paper on tools and resources of Indian origin