Title | The building of a comprehensive toponym corpus for Chinese information processing |
Authors | Liu, Qiang Yu, Jingsong Wu, Shenglan Wang, Huilin |
Affiliation | Department of Language Information Engineering, Peking University, No. 5, Yiheyuan Road, Haidian District, Beijing 100871, China Institute of Scientific and Technical Information of China, No. 15, Fuxing Road, Beijing 100038, China |
Issue Date | 2013 |
Publisher | icic express letters part b applications |
Citation | ICIC Express Letters, Part B: Applications.2013,4,(5),1409-1415. |
Abstract | This paper describes the process of creating a comprehensive and large-scaled Chinese Toponym Corpus which included names (and aliases) of every administrative divisions, roads or streets, and buildings as many as we can find in mainland China and geographical relationships among them. We use government standard files, GPS points of interests database and addresses information crawled and extracted from some house and office renting web sites as raw data for corpus building. N-gram counting set, improved mutual information and other parameters and bootstrapping method are computed to acquire statistical models for Chinese address chunk segmentation and attributes annotation using tag set we specially designed for Chinese natural language processing. We performed structural analysis and in depth statistical analysis of the Chinese toponyms and geographical entities to obtain a categorized toponym dictionary. Finally, based on sematic analysis of Chinese Toponym Corpus and results of all previous work, a Chinese toponym ontology with probabilistic information was built up using Neo4j graph database system. ? 2013 ISSN 2185-2766. |
URI | http://hdl.handle.net/20.500.11897/410426 |
ISSN | 21852766 |
Indexed | EI |
Appears in Collections: | 待认领 |