The reign of the English language over modern technology and the Internet may soon be at an end. Increasingly, local language technologies are emerging to challenge the role of English as the language of the web. Representing Urdu and other Pakistani languages at the forefront of this battle is the Centre for Research in Urdu Language Processing (CRULP). For Dr. Sarmad Hussain, founding director of CRULP, and his team, developing the capacity of local language processing is not merely an intellectual exercise in machine processing research but their contribution to the global struggle, which aims to provide every human access to information regardless of the language they speak. Like the translators of Al-Mamun, the eighth Abbasid caliph, who translated and protected many of the classics of Greek, Indian, Persian and Chinese scholarship from the ash-heap of history, the team at CRULP is working to bridge the disconnect that exists between the wealth of knowledge available on the Internet and the large non-English speaking segment of Pakistani society. While this team may not have royal patronage like the Abbasid translators, who were paid in gold equal to the weight of the books that they translated, the dissemination of knowledge and the legacy of scholarship team CRULP leaves behind will be invaluable. I recently visited the CRULP headquarters at National University of Computer and Emerging Sciences (NUCES), Lahore, where project manager Kiran Khurshid showed me around the CRULP lab and talked about the various projects currently in progress.

The overarching goal of CRULP is to develop local language processing technologies to provide people easy access to information regardless of the local language they speak. The traditional approaches to introducing technology into rural areas have involved providing schools and colleges with computers and expecting the locals to learn and adapt to modern technology. Dr. Hussein sees a fundamental flaw in this approach, in that they either fail to address or underestimate the two major barriers people face in using modern technology: illiteracy and language. With 45% of the population illiterate and most people unable to interact in English, it is impractical to expect them to use computers to access information through current technology. The team at CRULP aims to break the illiteracy barrier by developing Urdu Speech Recognition systems and Text to Speech systems to allow users to operate technology vocally. The language barriers are being tackled through the development of software in Urdu, examples of which include the SeaMonkey internet suite that provides users Urdu-based tools to make websites, surf the internet, email etc.

A snapshot of the seamonkey browser

A snapshot of the SeaMonkey browser

When asked about the decline of the use of Urdu and the increasing use of English in the modern Pakistani society, Dr. Hussain replied that this phenomena is limited in scope and is not representative of the general Pakistani population. While the urban elite may see a decline of Urdu in their immediate social circle, the truth is that the majority of the population still interacts solely in Urdu and other local languages. This can be seen most readily in the circulation of newspapers: if you calculate the circulation of all the English dailies, they would equal about 15% of the circulation of the Daily Jang (an Urdu daily) alone. Dr. Hussain also pointed that even most of his peers, who have received higher education abroad and are among the most educated members of the population, read Urdu Newspapers; another indication of the continued dominance of Urdu in Pakistani society. His experience with students in primary schools in rural areas of Pakistan further supports his claim. For example, in an exercise, students were told how to access the websites of an Urdu newspaper and an English newspaper, and then asked to retrieve information regarding a certain event that had occurred the day before. More than 99% of the students went to the Urdu website to retrieve the information. Furthermore, when asked to write essays about themselves in Urdu and English, the fact that most students wrote only single paragraphs about themselves in English but wrote multiple pages about themselves in Urdu again shows how Urdu is still the language of choice for most Pakistanis.

Dr. Hussain’s love for Urdu and other local languages is plainly apparent as he talks about the importance of these languages. He understands that it is possible for English to become a universal language spoken by all, but points out that it would result in a huge cultural loss. The accents, oral histories, stories, songs, and poems, are all key cultural components of people and need to be preserved. CRULP is also playing a part in the preservation of local languages by documenting them, for example, the Torwali language — a language spoken by the people of Swat. Today only 60,000 people speak Torwali, making it an endangered language. The CRULP team is currently working with a scholar from the Swat region to prepare a Torwali dictionary, document Torwali grammar, and record Torwali literature like poems.

The CRULP team is working not only to preserve the lexicon of language, but also the beauty of the language itself. An example mentioned by Dr. Hussain regards the numerous ways each Urdu letter can appear in Urdu text. A single letter can have up to forty different forms depending on where it appears within a word, line, and paragraph. From a software engineers point of view, it would be practical to amalgamate similar forms of the letter and treat them as a single case, but the team refuses to compromise on the beauty of the letters and treats each case differently. The justification of text also requires special consideration in Urdu calligraphy. While English fonts usually increase the number of spaces between letters to justify text, Urdu calligraphers have traditionally stretched the letters themselves for this purpose and fonts developed by CRULP also use the same rule to justify Urdu text. Other projects in the pipeline include the use of text-to-speech and speech recognition systems to develop software geared towards blind users so they too can use computer and internet tools.

Even with the development of these local-language based systems, the availability of content still remains a major challenge. The fact is that until there isn’t sufficient Urdu content available in digital format, working on technology to make local language material accessible is of limited use. A goal mentioned by many of the team members and Dr. Hussain himself is the development of software that would take entire web pages and translate them into Urdu. Once this is done, the entire world wide web would be available to the Urdu speaking population. This is a goal the team hopes to achieve in the future.

To conclude, here are highlights of three projects currently in progress at CRULP.

Project Dareecha: Introducing sustainable technology into rural schools

Dareecha is the Pakistani component of the PAN localization project, a regional initiative funded by the IRDC (International Development Research Centre), Canada, which aims to develop sustainable technology for use in South and South-Eastern countries by encouraging the development of local language computing. Phase 2 of the project is currently in process, which is focusing on implementation of Urdu based software in local areas. Under this effort, CRULP has

Students in a training session in Maluwala

Students in a training session in Maluwala

established computer labs at 10 schools in the Sargodha district of Punjab. During a course consisting of 3 five-day training sessions, students from these schools were taught the basics of computing, word processing internet browsing, website composition, email, and  chatting, all using Urdu based software. Some of the websites created by the students can be seen here and here (keep in mind that these were made by students who had never used computers before). The training session being complete, the evaluation of the students progress in now underway. The team is in touch with the students via email and summer website design competition is underway to encourage students to continue to use the labs and software and to gauge the effectiveness of the different programs developed.

Students using their newly learned email skills to stay in touch with the CRULP team

Students using their newly learned email skills to stay in touch with the CRULP team

Adaptive English Language Teaching Tool (AELTT)

This project aims to develop thirty English language lessons to be used by 9th grade students. While such software exists, it mainly focuses on teaching English to European users thus making it unsuitable for Pakistani users who have difficulty understanding the accent of the voices used in these software and cannot relate to the given examples (for example, references are given to Christmas trees and bowling alleys). AELTT team members emphasized the fact that the software developed by the  will use a voice with a marked Pakistani accent and will only include local names and references to make sure students don’t have trouble understanding the lessons. Thus students will no longer  have to learn English through stories about  ‘Peter’ and ‘Jane’ playing ‘lacrosse’. The Institute of Education Research, Punjab University is collaborating with CRULP on this project and have planned and created thirty lessons. The team hopes to test the AELTT by August after which they hope to see it being used by primary schools across Pakistan.

Telephone-based Speech Interfaces for Access to Information by Non-literate Users

This project is being carried out in collaboration with the Language Technologies Institute (LTI) of Carnegie Mellon University and the Agha Khan University and aims to develop a system, which would enable health workers to access medical information via phone through a speech recognition and text-to-speech system. This would make use of Pakistan’s extensive mobile phone network to give these workers access to information while out in the field. The text to speech software has been developed and currently team lead Huda Sarfraz and her team are working to develop a repository of audio files of Urdu words being spoken by different people. Once this is completed, the technology can not only be used to build the medical information system, but can also be deployed in future projects such as the system for blind users mentioned above.

9 Responses to “Research Highlight: Wiring Urdu to the 21st century”

  1. I remember the day when we were planning to have First National Urdu Language Processing Workshop. Me ,Dr. Sarmad , Bilal Hashmi and others will spend countless days and nights in the office , juggling between our daily work and this event. There were ups and downs, when we would think if there will be funding for the event or not etc. But the event happened and well, it was just brilliant how it transformed into my batch being the first one to start different projects in CRULP. Rest is history! It surely gives me a great sense of pride that i was a part of this project’s foundations. Still far more things to achieve – but if at all, this country has always shown its capacity to deliver on many fronts. :)

    Cheers!
    Khurram Mir

  2. Bilal Munir says:

    Happy to see that my institute is doing a great job. Hats off to Dr. Hussain.

    -Bilal

  3. Aamir Berni says:

    I think developing Urdu software is more difficult (and no less valuable) than developing the nuclear bomb. If the nuclear bomb provided protection of our existence, then Urdu computing will provide us intellectual independence & protection. I can’t recall one advanced country that depends on English (as a foreign language). Only backward people ignore their mother tongues and as we can’t become rich by borrowing more loans from IMF, we can’t advance in science by borrowing someone else’s language. It is a shame that Urdu is considered merely for its poetic & religious heritage & beauty. I mean whenever I read about Urdu on Internet, they are just talking about poetry & religion. It is nice, thus, to see CRULP doing serious business in Urdu and for Urdu. I hope next time I will be able to write my comments here in Urdu :). Thanks.

  4. Aamir Berni says:

    P.S. Isn’t our 60+ year history of imposing English a proof that it has failed? When our government will remove English from the list of official languages of Pakistan and accept the fact that Urdu is here to stay? Thanks.

  5. Abdul Jabbar Khan says:

    If the nuclear bomb provided protection of our existence, then Urdu computing will provide us intellectual independence & protection. I can’t recall one advanced country that depends on English (as a foreign language). Only backward people ignore their mother tongues and as we can’t become rich by borrowing more loans from IMF, we can’t advance in science by borrowing someone else’s language. It is a shame that Urdu is considered merely for its poetic & religious heritage & beauty. I mean whenever I read about Urdu on Internet, they are just talking about poetry & religion. It is nice, thus, to see CRULP doing serious business in Urdu and for Urdu. I hope next time I will be able to write my comments here in Urdu :). Thanks.

    Reply

  6. وعلیکم السلام
    اوپن سورس میں تو شائید نہ ملے، مگر اسکے علاوہ مختلف جگہوں پر اردو اور اسکے فونٹس پر تحقیق وغیرہ جاری ہے۔ اسی سلسلے میں کچھ جگہوں پر اردو کے فانٹس بھی بنائیے جا چُکے ہیں اور ونڈوز میں اردو اور اسے مختلف کی بورڈ بھی مہیا کیئے جا چُکے ہیں۔ اردو کی بورڈ اور ونڈوز میں اردو استعمال کرنے کے لئے آپ اس سائٹ سے تفصیلات حاصل کر سکتے ہیں۔:۔

    http://crulp.org

  7. Zayd says:

    اُردو نستعلیق اب اوپن سورس میں مہیا ہے، نفیس نستعلیق کے نام سے (“کرلپ” کی ویب سائیٹ پر)۔ مگر اُردو انٹرنٹ پر تب تک مقبول نہیں ہو گی جب تک اِنٹرنٹ پر سارے اُردو کے مقالے، بلوگز وغیرہ نستعلیق میں نہ ہوں۔ اردُو اخبار مثلاً جنگ وغیرہ کے اِنٹرنٹ ایڈیشن اِسی لیے مقبول ہیں کیونکہ یہ اخبار نستعلیق اُردو میں ہیں۔ کاش کہ “اِمیج” کے بجائے یہ لوگ اب اوپن سورس نستعلیق فونٹ استعمال کریں۔ امید ہے کہ “کرلپ” اس سلسے میں بھی تحقیق کر رہا ہے اور جلد ہی ہمارا یہ خواب پورا ہو جائے گا۔

  8. Anonymous says:

    MS-URDU EDITOR

  9. HmdarD says:

    MashaALLAH, achha research ka kaam chal raha hai. Lekin abhi tak hum koi ek bhi Urdu Word processing software tou bana nahi sakay, Web designing software kese bnayein gay?. Ek Inpage banaya hai wo bhi hamaray sarhad(border) ke paar walay Bhaiyo’n ne. Pooray Pakistan me Ek percent logo’n ke pas original Licensed Inpage hai, 99 percent to chori ka Inpage use karte hain.
    chalo koi ek Keyboard Layout pe hi sab ittefaq kar letay !!! ab Pakistan me har koi apni apni keyboard layout bna raha hai. main 3 alag alag keyboard layout se zbani urdu typing kar skta hu. 1) Xp ki Urdu Layout.2)Inpage phonetic. 3)xp Phonetic. meray khyal se 10 se ziada layouts hain. har koi apni pasand ki layout use karta hai. CRULP ko ek Single prototype phonetic layout jari karni chahiye aur baqi sab pe pabandi lga deni chahiye.
    Asal me koi Pakistan aur apni Qoumi Zuban ke saath devoted hi nahi hai. Jo ek do hain unke pas wsaayil hi nahi hain.

Discuss

  • STEP aspires to be the central place for discussion on improving the state of Science, Technology, and Education in Pakistan. Read More
  • To learn how you can contribute, click here
  • Never miss a new article! Choose your favorite method to stay up to date with STEP
  •