A Sentence Analyzer and Viewer for Detecting Grammatical Errors

Sponsored by the Committee on the Status of Women in Computing Research (CRA)


[ Home Page | The Team | Project Presentation | Final Report | Picture Us | Department Page | CREU Program | Contact Us ]


A Sentence Analyzer for Detecting Grammatical Errors

Perpetual Amoah

Rose Lupiana

Dr. Lila Ghemri (advisor)

Texas Southern University 

Abstract

We present a Sentence Analyzer for Detecting and identifying grammatical errors in a sentence. This system is intended for users learning English as a second language or young learners trying to master the grammar of English. The system assesses their skill level based on the grammatical correctness of the sentences they write; it relies on Natural Language Processing (NLP) tools and techniques to analyze the learner’s input.

In this paper, we will discuss natural language processing techniques, their use as language teaching tool and present an overview of our system, its implementation and possible extensions.  

Natural Language Processing and Language Teaching tools

Natural Languages are languages that people use to communicate with one another such as English, French and others. Many natural language processing tasks involve analyzing texts of varying sizes, ranging from a single sentence to a paragraph. The different stages of natural language processing are Tokenization, Lexical Analysis, Syntactic Analysis, and Semantic analysis.  We shall only discuss the first three stages as the last one is not relevant in our system.  

Tokenization

Tokenization can be defined as the process of mapping sentences from character strings into strings of words. A typical example of tokenizing a sentence like “Rose kisses John” is [<Rose>, <kisses>, <John>], in which each word is delimited. 

Lexical Analysis

Lexical analysis verifies that all the words in the sentence the user enters belong to the lexicon. In simple examples and small systems, we can list all the words allowed by the system, these are usually contained in a module called the dictionary. However, large vocabulary systems face a problem in representing the lexicon. Not only are there a large number of words, but affixes and postfixes can also be present. One way to address this problem is to reprocess the input sentence into a sequence of morphemes.  

Misspelled words

During the sentence processing, lexical analysis may fail if the program encounters a word that is not in the dictionary. The problem can stem from two origins; either the word is a valid word in the language that has not been included in the dictionary or the word has been misspelled and thus cannot be recognized.  Systems used for language teaching usually include a subcomponent that suggests corrections to unknown words (either misspelled or not in the dictionary). Two methods of correction were studied during this project:  

  • Longest-common subsequence problem

The longest-common subsequence is an efficient method of finding the closest match to an unknown word. It orders and selects candidates based on the common subsequence of characters that they share with the unknown word. 

  • Error-tolerant Finite-state Recognition

 Error-tolerant recognition enables the recognition of strings that deviate slightly from any string in the regular set recognized by the underlying finite-state recognizer. This recognition has applications to error-tolerant morphological processing, spelling correction, and approximate string matching in information retrieval.

 

Syntactic Analysis

The purpose of syntactic analysis is to determine the structure of the input text. This structure consists of a hierarchy of phrases, the smallest of which are the basic symbols and the largest of which is the sentence. The basic symbols are represented by leaf nodes and other phrases by interior nodes. The root of the tree represents the sentence.

The syntactic analysis is performed using a parser. A parser is a program that receives tokenized items, labels each item based on its lexical category (nouns, verbs, adjectives and such) and then uses grammar rules to build a syntactic or a parse tree  

Grammatical Error Detector

In English, agreement is required for every sentence at the clause level and at the noun-phrase level. At the clause level, agreement is between Subjects and Verbs. At the Noun-Phrase level agreement is between Articles and Modified Nouns. Language learners in early stages commonly make agreement errors.   

System Overview:

Our system has been implemented using the Prolog language. Prolog is a logic programming language that has a built in mechanism for parsing context-free grammar or the Definite Clause Grammar (DCG) formalism. DCG s are an extension of context free grammars that have proven useful for describing natural and formal languages, and that may be conveniently expressed and executed in Prolog.  Conditions on grammaticality of sentences are interleaved with the parser rules and agreement checking applies when the appropriate phrases have been parsed. Upon detection of an error, the system displays messages to help the student identify the errors and fix them.

 
Implementation

Our Systems processes sentences entered by the learner using the following stages:

  1. Parses the learner’s input,
  2. Diagnoses agreement errors if applicable: the system detects two types of agreement errors: Subject-Verb agreement and Article-Noun agreement errors
  3. Responds to each with remedial feedback.

The following are two typical examples:

  1. Clause level: Rose kisses John vs. Rose kiss John
  2. Phrase level: an apple vs. an apples

Conclusion and Possible extensions

We have presented a system that correctly identifies agreement errors in English sentences and displays helpful messages to learners. Possible extensions include:

  • Adding Word Net as a source dictionary to increase the word list and the range of sentences that the system can process.
  • Adding rules to the parser so as it can handle word-order errors.
  • Improve messages to the user by suggesting corrections.
  • Implement error spelling algorithms

 

References:

[1] Tokuda  N; Chen L, ( 2001): “An Online Tutoring System for Language Translation” IEEE

[2] Tschichold C; Bodmer F; Cornu E;  Grosjean F;  Grosjean L; Kubler N; Lewy  N;  Tschum C, (2000): “Developing a new grammar checker for English as a second language” CELEX  

[3] Adair-Hauck, B; Willingham-Mclain, L; Youngs B(1999): “Evaluating the Integration of Technology and Second Language Learning”. CALICO Journal 17(2) 

[4] Schwind C (1995): “Error Analysis and Explanation in Knowledge Based Language Tutoring” Computer Assisted Language Learning Journal 8(4)    

[5] Yazdani, M (1989): “Artificial Intelligence Approach to Second Language Teaching”. Proceedings of the Second International Conference in Computer Assisted Learning.

 [6] Webb, G (1988): “A knowledge based approach to Computer Aided Learning” International journal of Man-Machine Studies, 29(3) 

[7]  Amzi! inc (1987-2004): “http://www.amzi.com/manuals/amzi7/pro/ref_dcg.htm”. Logic Server and e-Prolog are trademarks of Amzi! inc. 

[8] Blackburn, P; Bos, J;  Striegnitz, K; (2003): “http://www.coli.uni-saarland.de/~kris/learn-prolog-now/html/node54.html”. version 1.2.5 

[9] Voll, K; Yeh, T; Dahl, (2000): “An Assumptive Logic Programming Methodology for Parsing” Proceeding of the 12th IEEE International Conference on Tools with Artificial Intelligence; Burnaby, Canada.

 

 

Publications:

 

Conference attended: Texas Southern University Research week 2006 (Received 2nd place oral presentation award)

 

http://itscience.tsu.edu/ghemri/CREU/CREU.htm