Editor's Note: Zia Syed was involved in the development of the tool at Google, working with Google India's Indic Transliteration team.

Google's New Urdu Transliteration Tool

The vast majority of Pakistanis using the web are familiar only with English keyboards. Creating content in Urdu script is a slow and frustrating experience, as it requires either learning the Urdu keyboard layout, which is forced onto a keyboard designed for writing English, or using on-screen keyboards, which are useful but limited by the speed at which one can click the mouse. As a result, producing online content in Urdu script has mostly been limited to a small number of bloggers and commercial websites. For most users, writing Urdu using Roman script (transliteration) has become the main way of writing Urdu on computers. Transliteration is a technique that is used to do phonetic mapping of words written in one script (e.g. Arabic) to another script (e.g. Roman). For example, شکریہ transliterates into shukriya. While using Roman transliteration may be adequate for a lot of purposes (chatting), it leaves a lot to be desired from the perspective of people who prefer to read and write the language in its original script.

Google recently launched an exciting solution that turns the transliteration problem on its head, reverse transliterating Roman script into Arabic script: http://www.google.com/transliterate/indic/Urdu.

Even though transliteration is much simpler than translation, there are several challenges a transliteration system must overcome. The source script may not allow the users to correctly produce the desired sounds, e.g., there is no equivalent sound for ت or ڑ in English. Secondly, even if the equivalent sounding letters exist, they may map to several letters in the target script, e.g., ‘s‘ can map to س ,ص, or ث. The vowels pose yet another challenge, for example an ‘a‘ can either map to a punctuation mark in Urdu, e.g., a zabar and not show up in the script, or it can map to ‘ا’, ‘آ’, ‘ع’, or ‘ء’, and be visible. Lastly, not all people write Urdu in Roman script using the same convention, e.g., some people use a ‘q‘ to indicate ‘ق’ and others use ‘k‘ to mean both ‘ق’ and ‘ک’. A good transliteration system has to overcome these problems to be usable.

The Google service is not yet perfect, but it uses a combination of techniques to disambiguate between many potential choices during transliteration. These techniques include the use of an Urdu dictionary to give more weight to valid Urdu words, hard coding common words/pronouns, and using machine learning on parallel Roman and Urdu transliterated texts to learn about the common character sequence mappings. It performs on-the-fly conversion of words to the Urdu script. Any mistakes can be fixed by either pressing backspace for the last written word, or by clicking on any word. To correct a mistake, users have the option to choose from a list of alternatives or enter the word manually by using an on-screen Urdu keyboard. Try writing Mustansar Husain Tarar and you will see that it does a fairly good job (the last word will need correction by pressing backspace). More detailed usage instructions can be found here.

This service is launched as a Google Labs project, which means that it is experimental and will undergo changes to keep improving its quality based on the user feedback. It has already been well received by the online Urdu community and will hopefully contribute towards significantly increasing the amount of online Urdu content.

Discuss

  • STEP aspires to be the central place for discussion on improving the state of Science, Technology, and Education in Pakistan. Read More
  • To learn how you can contribute, click here
  • Never miss a new article! Choose your favorite method to stay up to date with STEP
  •