Featured Linguist!

Jost Gippert: Our Featured Linguist!

"Buenos dias", "buenas noches" -- this was the first words in a foreign language I heard in my life, as a three-year old boy growing up in developing post-war Western Germany, where the first gastarbeiters had arrived from Spain. Fascinated by the strange sounds, I tried to get to know some more languages, the only opportunity being TV courses of English and French -- there was no foreign language education for pre-teen school children in Germany yet in those days. Read more



Donate Now | Visit the Fund Drive Homepage

Amount Raised:

$33698

Still Needed:

$41302

Can anyone overtake Syntax in the Subfield Challenge ?

Grad School Challenge Leader: University of Washington


Publishing Partner: Cambridge University Press CUP Extra Publisher Login
amazon logo
More Info


New from Oxford University Press!

ad

What is English? And Why Should We Care?

By: Tim William Machan

To find some answers Tim Machan explores the language's present and past, and looks ahead to its futures among the one and a half billion people who speak it. His search is fascinating and important, for definitions of English have influenced education and law in many countries and helped shape the identities of those who live in them.


New from Cambridge University Press!

ad

Medical Writing in Early Modern English

Edited by Irma Taavitsainen and Paivi Pahta

This volume provides a new perspective on the evolution of the special language of medicine, based on the electronic corpus of Early Modern English Medical Texts, containing over two million words of medical writing from 1500 to 1700.


Academic Paper


Title: Computational Identification and Analysis of Complicated Sanskrit
Author: Subhash Chandra
Email: click here to access email
Homepage: http://sanskrit.jnu.ac.in/rstudents/subhash.html
Institution: Centre for Development of Advanced Computing
Linguistic Field: Computational Linguistics; Morphology; Syntax; Translation
Subject Language: Sanskrit
Abstract: This paper presents a model for computational identification and analysis of
complicated Sanskrit noun phrases [(nominal morphology or Sanskrit subanta-padas)
(NPs)] in Sanskrit text. The simple ones or those forms which are strictly rule
governed and fall in to patterns are not very difficult to analyze. However, there are
several complicated and ambiguous forms which pose a challenge for analyzers. The
purpose of this paper is to put forth a strategy and algorithm which can enable any
Sanskrit parser to recognize and analyze these complicated NPs. Identification
includes separating the NPs from Verb Phrases [(tinanta) (VPs)] by a strategy of
isolating verbs and in-declinables. Analysis includes splitting the NPs into its subconstituents - base [{(praatipadik) (any meaningful form of a word, which is neither a
root nor a suffix) (PDK)}], case-number markers [(karaka-vacana-vibhakti) (KVV)].
Sanskrit is a heavily inflected language and depends on serial inflections on nouns
and verbs for communication of meaning. A fully inflected unit is called pada
(useable word) which are NPs or VPs. Therefore identifying and analyzing these
inflections are critical to any further processing of Sanskrit.

According to Paanini, there are 21 nominal inflectional suffixes (seven
vibhaktis and three numbers 7 X 3 = 21) which are attached to the PDK according to
the category, gender, number, and end-character of the base. Some forms of Sanskrit
NPs can be very complicated for computational identification and analysis for the
examples. For examples: ramaah, bhavati, gacCati, etc. can be both a nominal as well
as verbal construction. The pronominal forms pose another challenge, as in most of
them; the inflected forms can not easily be related to their bases morphologically. We
may have to posit ad-hoc rules and processing to handle them. For example - ‘aham’
(first per sing), ‘tvam’ (second per sing), ‘sah’ (third per sing pronoun), ‘amu’ etc.
are NP formed from respectively the base ‘asmad’, ‘yusmad’, ‘tad’, etc. by inflecting
for nominative singular and ‘adas’ by inflecting for nominative dual.

The system first does punctuation, avyayas and verbs (non-NPs)
identification for NPs identification in Sanskrit text. After identification of these
words, system recognizes all remaining words as NPs and sends for analysis process.
System does identification of Avyaya (AV) and VPs with the help of AV and VP
database. We have stored around 524 AV forms, commonly used in modern Sanskrit
languages and about 500 commonly used verb roots and their forms for verb
recognition. So we have around 90,000 verb forms stored in UTF-8 Unicode
devanagari scripts. Thus the NPs in Sanskrit text are identified by a process of
exclusion. After the verbs and avyayas are identified by their lexical pattern matching
search, the remaining words in the text are labeled NPs.

The system also has some basic requirements for use- 1. JAVA installed to support
the Java Web Server. 2. Apache Tomcat 4.0 installed web Server. 3. Baraha software
for UTF-8 Unicode Devanagari input or any other. If the user’s machine does not have all of these then they can not use this system.

The present work is an attempt to process Sanskrit NP inflections by way of
Paanini’s rule system, appropriate database and example-base. The system developed is an online system run on Apache Tomcat platform using Java servlet, MSSQL server
2005 as back end and JBDC for connectivity. The goal is to simplify Sanskrit text for
self reading, understanding, and also for any Machine (Aided) Translation (MAT)
from Sanskrit to other languages.
Type: Individual Paper
Status: Completed
Venue: Allahabad, India
Publication Info: ICCS, Allahabad, Proceeding


Back
Add a new paper
Return to Academic Papers main page
Return to Directory of Linguists main page