ORGANIZATION OF DECO

2016-02-01 3,888

OVERVIEW

 

  

 

The Korean Electronic Dictionary System DECO (Jeesun Nam 2010, 2012, 2015) is a complex linguistic resource indispensable for the automatic analysis of Korean Text.   



It is perfectly compatible with UNITEX platform (Sebastien Paumier 2003) that allows not only to analyze Korean texts but also to construct diverse linguistic resources. 



 

DECO(Dictionnaire Electronique du COreen) Korean Electronic dictionaries consist of two major types of resources:  

 

 

TWO BASIC MODULES 

  

1. LEMMAS 

  

Lexical Elements stored in Tabluar-format Database. 

 

The lemmas are classified into 4 categories at first stage and then each category is sub-divided into several sub-classes. NS (Simple Noun), VS (Simple Verb), AS (Simple Adjective) and DS (Simple Adverb) are the basic 4 categories and every lemma entry should be assigned with one of these 4 categories.  

 

This module is currently composed of around 270,000 canonical forms (Version 3.0/2017) divided into 197,000 nouns (NS), 49,000 verbs (VS), 9,000 adjectives (AS) and 15,000 adverbs (DS).  

 

In this way, all lemmas are assigned with one of these 4 main categories and also with one of the inflectional class numbers. With these mandatory information, several types of optional information such as morphological, syntactic, semantic or sentimental information are attributed as well to these lemmas.      

 

 

2. INFLECTIONAL SUFFIXES

 

Grammatical Elements constructed under a FST(Finite-State Transducer) form.  

 

As inflectional suffixes are extremely complex and rich in type, number and combination among themselves in Korean, their combinatorial relations are represented under the form of Directed Acyclic Graphs that will be automatically transformed into FSTs in UNITEX platform. 

 

 

 

DECO-TAGSETS

 

Regarding Lexical elements, the category NS consists of Simple Nouns, Numerals, Pronouns, Determiners, Dependant Nouns & Classifiers, and Modifying Nouns on one hand, and  Derived and Compound Nouns, Proper Nouns, Borrowed Nouns (i.e. Transliteration from English words) and Corpus-based Colloquial Nouns on the other hand. Whereas the former is relatively closed in number, the latter is open, since it is continuously increasing in number in current Korean texts.   

  

The category VS consists of Simple Verbs as well as Complex Verbs including those derived from adjectives and nouns. Corpus-based Colloquial Verbal Forms are registered as well in the current version.  

 

The category AS also consists of Simple Adjectives as well as Complex Adjectives including those derived from nouns. Corpus-based Colloquial Adjectival Forms are also registered in the current version.  

 

In case of DS, this category includes not only Simple Adverbs, but also Exclamations, Onomatopoeias and Complex Adverbs derived from adjectives. Corpus-based Colloquial Adverbial Forms are also registered in the current version.  

 

The following table shows the organization of DECO-Tagset for the Lexical Elements:  

 

 

 

The DECO-Tagsets for the Inflectional Suffixes that are constructed under the form of Finite-State Transducer (FST) are as follows: Suffixes (N-postpositions: i.e. JOSA) for Noun (JN), Suffixes (D-postpositions: i.e. JOSA) for Adverb (JD), Suffixes (A-postpositions: i.e. EOMI) for Adjective (EA), and Auffixes (V-postpositions: i.e. EOMI) for Verb (EV). 

    

  

 

INFLECTIONAL CLASSES

 



As Korean is an agglutinative language, inflectional morphology is one of the most difficult and complex research area. Building an electronic dictionary of Korean relevant to recognize and analyze all correct inflected forms is an extremely challenging task.  

 

The total number of inflected forms of all grammatical categories recognized by DECO inflectional suffix transducers is estimated to be nx108, which is too huge and inefficient to be provided under a list form.  

 

For this reason, DECO dictionaries of canonical forms are transformed into finite-state transducers and then linked to the inflectional suffix transducers, instead of constituting full-form dictionaries like European language inflectional dictionaries such as French or Italian ones. 

 _?xml_:namespace prefix = "o" />_?xml_:namespace prefix = "o" />_?xml_:namespace prefix = "o" />_?xml_:namespace prefix = "o" />_?xml_:namespace prefix = "o" />_?xml_:namespace prefix = "o" />_?xml_:namespace prefix = "o" />

Each lexical entry (i.e. canonical form) is assigned with a POS (Part of Speech) tag and an inflectional class number linking it properly to a set of inflectional suffixes. In case of Nouns and Adverbs, 3 inflectional classes are defined according to the final consonant types (No Consonant, Consonant except 'ㄹ', Consonant 'ㄹ'). Regarding Verbs and Adjectives, 28 inflectional classes and 26 inflectional classes are determined respectively.   

 

  

MORPHO-SYNTACTIC-SEANTIC INFORMATION 

 

Besides POS information, Morphological information is assigned to the lexical entries as well: for instance there are ‘PHA (Predicative noun making an adjective in hada)’, ‘MES (Classifier dependant noun)’, ‘FIR (The first Person Pronoun)’, ‘NUM (Basic Numeral)’, ‘DOP (Verb with the suffix doida)’, ‘CPA (Copula)’, and so on. The total number of the morphological information tags is around 120 for the 4 basic categories.

 

Syntactic Information is attributed to the adjective and verb entries: for instance, in case of Adjectives, ‘YAWS' (Class of Symmetric Adjectives) is formally defined via the syntactic criterion “N0 N1-wa (=with) ADJ = N1 N0-wa (=with) ADJ” (i.e. this syntactic equivalence means that the permutation of the two arguments (N0 & N1) is allowed without any change of the logical relation of these arguments). e.g. pyenghanghada (=to be parallel))’. 

 

Concerning Verbs, for example, ‘YVWZ' (Class of Symmetric Verbs) is defined with “N0 N1-wa V”. The permutation between N0 and N1 is allowed in ‘John quarreled with Susan (=Susan quarreled with John)’, whereas it is not in ‘John relied on Susan ( Susan relied on John)’). 

 

The total number of Syntactic classes is 26 for both categories. The classification of Korean adjectives and verbs is performed on the basis of Lexicon-Grammar framework (Maurice Gross 1975, Jeesun Nam 1996, 2007, Soyun Kim & Jeesun Nam 2010). 

 

Semantic Information is attributed to all categories, i.e. Nouns, Adjectives, Verbs and Adverbs: in NS category, the simple common nouns (ZNZ) are classified into 56 semantic classes; the verbal entries are classified into 30 classes; the adjectival entries is classified into 12 classes; and in DS category, 9,000 adverbs (i.e. except 5,500 onomatopoeia) are classified into 6 semantic classes.

 

 

 

ILLUSTRATION

  

Let us consider the first entries of DECO-NS dictionary:

 


 


The dictionary {DECO-NS.dic} is automatically transformed into the following format called {DECO-NSflx.txt} where all syllables of roots are decomposed into consonants and vowels and multiplicated into several variants if necessary (e.g. irregular verbs). The latter is ready to be associated with a set of inflectional suffixes by means of the number of inflectional classes.

 

 

 

The following graph represents the main graph calling all inflectional classes of suffixes for Adjectives ({EA.grf}), which will be connected to the dictionary of adjectives, i.e. {DECO-ASflx.txt}.

 

 

  

  

한국어 전자사전 시스템 DECO  

 

한국어 전자사전 시스템 DECO (Nam 2010, 2015, 2017)은 테이블 형식으로 저장되어 있는 기본형(Lemma) 사전과 유한상태 트랜스듀서(Finite-State Transducer: FST) 방식으로 저장되어 있는 활용어미(Inflectional Suffix) 클라스 사전으로 구성된다.  

 

전자의 경우는 크게 4개의 대범주로 분류되는데, 명사범주(NS), 동사범주(VS), 형용사범주(AS), 부사범주(DS)의 4가지로 구성되며 각 대범주는 범주별로 몇 가지 하위 유형으로 다시 분류되는 방식을 취한다. 각 기본형 엔트리는 이러한 대범주 정보와 활용어미 클라스 번호 정보, 그리고 그외의 다양한 형태적, 통사적, 의미적 그리고 감성관련 정보 태그를 부여받는다.  

 

후자의 경우는 한국어에 매우 복잡하게 발달해 있는 활용어미들의 결합 관계를 고려하여 유한한 방식의 트랜스듀서(FST)로 기술되는데, 현재 UNITEX 플랫폼에서 FSTEditor에 의해서 방향성 비순환 그래프 (Directed Acyclic Graph: DAG) 형식으로 표상되면 자동으로 FST로 변환되어 실제 코퍼스 처리를 위한 전자사전의 역할을 수행하게 된다.