Institutional Repository of Peking University: Clustering based Two-Stage Text Classification Requiring Minimal Training Data - 开云app体育

Title	Clustering based Two-Stage Text Classification Requiring Minimal Training Data
Authors	Zhang, Xue Xiao, Wangxin
Affiliation	Peking Univ, Key Lab High Confidence Software Technol, Minist Educ, Beijing 100871, Peoples R China. Peking Univ, Sch Elect Engn & Comp Sci, Beijing 100871, Peoples R China. Shangqiu Normal Univ, Dept Phys, Shangqiu 476000, Peoples R China. Jinggangshan Univ, Dept Comp Sci, Jian 343009, Jiangxi, Peoples R China. Changsha Univ Sci & Technol, Sch Traff & Transportat Engn, Changsha 410114, Hunan, Peoples R China.
Keywords	text classification clustering active semi-supervised clustering two-stage classification
Issue Date	2012
Publisher	computer science and information systems
Citation	COMPUTER SCIENCE AND INFORMATION SYSTEMS.2012,9,(4,SI),1627-1643.
Abstract	Clustering has been employed to expand training data in some semi-supervised learning methods. Clustering based methods are based on the assumption that the learned clusters under the guidance of initial training data can somewhat characterize the underlying distribution of the data set. However, our experiments show that whether such assumption holds is based on both the separability of the considered data set and the size of the training data set. It is often violated on data set of bad separability, especially when the initial training data are too few. In this case, clustering based methods would perform worse. In this paper, we propose a clustering based two-stage text classification approach to address the above problem. In the first stage, labeled and unlabeled data are first clustered with the guidance of the labeled data. Then a self-training style clustering strategy is used to iteratively expand the training data under the guidance of an oracle or expert. At the second stage, discriminative classifiers can subsequently be trained with the expanded labeled data set. Unlike other clustering based methods, the proposed clustering strategy can effectively cope with data of bad separability. Furthermore, our proposed framework converts the challenging problem of sparsely labeled text classification into a supervised one, therefore, supervised classification models, e. g. SVM, can be applied, and techniques proposed for supervised learning can be used to further improve the classification accuracy, such as feature selection, sampling methods and data editing or noise filtering. Our experimental results demonstrated the effectiveness of our proposed approach especially when the size of the training data set is very small.
URI	http://hdl.handle.net/20.500.11897/291798
ISSN	1820-0214
DOI	10.2298/CSIS120130044Z
Indexed	SCI(E)
Appears in Collections:	高可信软件技术教育部重点实验室信息科学技术学院

Files in This Work

There are no files associated with this item.

Web of Science®

2

Checked on Last Week

Scopus®

Checked on Current Time

百度学术™

0

Checked on Current Time

Google Scholar™

Check

Show full item record

License: See PKU IR operational policies.