The reign of the English language over modern technology and the Internet may soon be at an end. Increasingly, local language technologies are emerging to challenge the role of English as the language of the web. Representing Urdu and other Pakistani languages at the forefront of this battle is the Centre for Research in Urdu Language Processing (CRULP). For Dr. Sarmad Hussain, founding director of CRULP, and his team, developing the capacity of local language processing is not merely an intellectual exercise in machine processing research but their contribution to the global struggle, which aims to provide every human access to information regardless of the language they speak. Like the translators of Al-Mamun, the eighth Abbasid caliph, who translated and protected many of the classics of Greek, Indian, Persian and Chinese scholarship from the ash-heap of history, the team at CRULP is working to bridge the disconnect that exists between the wealth of knowledge available on the Internet and the large non-English speaking segment of Pakistani society. While this team may not have royal patronage like the Abbasid translators, who were paid in gold equal to the weight of the books that they translated, the dissemination of knowledge and the legacy of scholarship team CRULP leaves behind will be invaluable. I recently visited the CRULP headquarters at National University of Computer and Emerging Sciences (NUCES), Lahore, where project manager Kiran Khurshid showed me around the CRULP lab and talked about the various projects currently in progress.
The overarching goal of CRULP is to develop local language processing technologies to provide people easy access to information regardless of the local language they speak. The traditional approaches to introducing technology into rural areas have involved providing schools and colleges with computers and expecting the locals to learn and adapt to modern technology. Dr. Hussein sees a fundamental flaw in this approach, in that they either fail to address or underestimate the two major barriers people face in using modern technology: illiteracy and language. With 45% of the population illiterate and most people unable to interact in English, it is impractical to expect them to use computers to access information through current technology. The team at CRULP aims to break the illiteracy barrier by developing Urdu Speech Recognition systems and Text to Speech systems to allow users to operate technology vocally. The language barriers are being tackled through the development of software in Urdu, examples of which include the SeaMonkey internet suite that provides users Urdu-based tools to make websites, surf the internet, email etc.
When asked about the decline of the use of Urdu and the increasing use of English in the modern Pakistani society, Dr. Hussain replied that this phenomena is limited in scope and is not representative of the general Pakistani population. While the urban elite may see a decline of Urdu in their immediate social circle, the truth is that the majority of the population still interacts solely in Urdu and other local languages. This can be seen most readily in the circulation of newspapers: if you calculate the circulation of all the English dailies, they would equal about 15% of the circulation of the Daily Jang (an Urdu daily) alone. Dr. Hussain also pointed that even most of his peers, who have received higher education abroad and are among the most educated members of the population, read Urdu Newspapers; another indication of the continued dominance of Urdu in Pakistani society. His experience with students in primary schools in rural areas of Pakistan further supports his claim. For example, in an exercise, students were told how to access the websites of an Urdu newspaper and an English newspaper, and then asked to retrieve information regarding a certain event that had occurred the day before. More than 99% of the students went to the Urdu website to retrieve the information. Furthermore, when asked to write essays about themselves in Urdu and English, the fact that most students wrote only single paragraphs about themselves in English but wrote multiple pages about themselves in Urdu again shows how Urdu is still the language of choice for most Pakistanis.
Dr. Hussain’s love for Urdu and other local languages is plainly apparent as he talks about the importance of these languages. He understands that it is possible for English to become a universal language spoken by all, but points out that it would result in a huge cultural loss. The accents, oral histories, stories, songs, and poems, are all key cultural components of people and need to be preserved. CRULP is also playing a part in the preservation of local languages by documenting them, for example, the Torwali language — a language spoken by the people of Swat. Today only 60,000 people speak Torwali, making it an endangered language. The CRULP team is currently working with a scholar from the Swat region to prepare a Torwali dictionary, document Torwali grammar, and record Torwali literature like poems.
The CRULP team is working not only to preserve the lexicon of language, but also the beauty of the language itself. An example mentioned by Dr. Hussain regards the numerous ways each Urdu letter can appear in Urdu text. A single letter can have up to forty different forms depending on where it appears within a word, line, and paragraph. From a software engineers point of view, it would be practical to amalgamate similar forms of the letter and treat them as a single case, but the team refuses to compromise on the beauty of the letters and treats each case differently. The justification of text also requires special consideration in Urdu calligraphy. While English fonts usually increase the number of spaces between letters to justify text, Urdu calligraphers have traditionally stretched the letters themselves for this purpose and fonts developed by CRULP also use the same rule to justify Urdu text. Other projects in the pipeline include the use of text-to-speech and speech recognition systems to develop software geared towards blind users so they too can use computer and internet tools.
Even with the development of these local-language based systems, the availability of content still remains a major challenge. The fact is that until there isn’t sufficient Urdu content available in digital format, working on technology to make local language material accessible is of limited use. A goal mentioned by many of the team members and Dr. Hussain himself is the development of software that would take entire web pages and translate them into Urdu. Once this is done, the entire world wide web would be available to the Urdu speaking population. This is a goal the team hopes to achieve in the future.
To conclude, here are highlights of three projects currently in progress at CRULP.
Project Dareecha: Introducing sustainable technology into rural schools
Dareecha is the Pakistani component of the PAN localization project, a regional initiative funded by the IRDC (International Development Research Centre), Canada, which aims to develop sustainable technology for use in South and South-Eastern countries by encouraging the development of local language computing. Phase 2 of the project is currently in process, which is focusing on implementation of Urdu based software in local areas. Under this effort, CRULP has
established computer labs at 10 schools in the Sargodha district of Punjab. During a course consisting of 3 five-day training sessions, students from these schools were taught the basics of computing, word processing internet browsing, website composition, email, and chatting, all using Urdu based software. Some of the websites created by the students can be seen here and here (keep in mind that these were made by students who had never used computers before). The training session being complete, the evaluation of the students progress in now underway. The team is in touch with the students via email and summer website design competition is underway to encourage students to continue to use the labs and software and to gauge the effectiveness of the different programs developed.
Adaptive English Language Teaching Tool (AELTT)
This project aims to develop thirty English language lessons to be used by 9th grade students. While such software exists, it mainly focuses on teaching English to European users thus making it unsuitable for Pakistani users who have difficulty understanding the accent of the voices used in these software and cannot relate to the given examples (for example, references are given to Christmas trees and bowling alleys). AELTT team members emphasized the fact that the software developed by the will use a voice with a marked Pakistani accent and will only include local names and references to make sure students don’t have trouble understanding the lessons. Thus students will no longer have to learn English through stories about ‘Peter’ and ‘Jane’ playing ‘lacrosse’. The Institute of Education Research, Punjab University is collaborating with CRULP on this project and have planned and created thirty lessons. The team hopes to test the AELTT by August after which they hope to see it being used by primary schools across Pakistan.
Telephone-based Speech Interfaces for Access to Information by Non-literate Users
This project is being carried out in collaboration with the Language Technologies Institute (LTI) of Carnegie Mellon University and the Agha Khan University and aims to develop a system, which would enable health workers to access medical information via phone through a speech recognition and text-to-speech system. This would make use of Pakistan’s extensive mobile phone network to give these workers access to information while out in the field. The text to speech software has been developed and currently team lead Huda Sarfraz and her team are working to develop a repository of audio files of Urdu words being spoken by different people. Once this is completed, the technology can not only be used to build the medical information system, but can also be deployed in future projects such as the system for blind users mentioned above.