India is building datasets in local languages to develop AI-based services and tools that non-English speakers struggle to access
For a few weeks this year, villagers in the southwestern Indian state of Karnataka read out dozens of sentences in their native Kannada language into an app as part of a project to build the country’s first AI-based chatbot for Tuberculosis.
There are more than 40 million native Kannada speakers in India, and it is one of the country’s 22 official languages and one of over 121 languages spoken by 10,000 people or more in the world’s most populous nation.
But few of these languages are covered by natural language processing (NLP), the branch of artificial intelligence that enables computers to understand text and spoken words.
Hundreds of millions of Indians are thus excluded from useful information and many economic opportunities.
“For AI tools to work for everyone, they need to also cater to people who don’t speak English or French or Spanish,” said Kalika Bali, principal researcher at Microsoft Research India.
The villagers in Karnataka are among thousands of speakers of different Indian languages generating speech data for tech firm Karya, which is building datasets for firms such as Microsoft and Google to use in AI models for education, healthcare and other services.
The Indian government, which aims to deliver more services digitally, is also building language datasets through Bhashini, an AI-led language translation system that is creating open source datasets in local languages for creating AI tools.
Of the more than 7,000 living languages in the world, fewer than 100 are captured in major NLPs, with English the most advanced.
Source: context news