JCP 2009 Vol.4(3): 230-237 ISSN: 1796-203X
doi: 10.4304/jcp.4.3.230-237
doi: 10.4304/jcp.4.3.230-237
An Improved KNN Text Classification Algorithm Based on Clustering
Zhou Yong, Li Youwen, Xia Shixiong
School of Computer Science & Technology, China University of Mining & Technology, Xuzhou, Jiangsu 221116, China
Abstract—The traditional KNN text classification algorithm used all training samples for classification, so it had a huge number of training samples and a high degree of calculation complexity, and it also didn’t reflect the different importance of different samples. In allusion to the problems mentioned above, an improved KNN text classification algorithm based on clustering center is proposed in this paper. Firstly, the given training sets are compressed and the samples near by the border are deleted, so the multipeak effect of the training sample sets is eliminated. Secondly, the training sample sets of each category are clustered by k-means clustering algorithm, and all cluster centers are taken as the new training samples. Thirdly, a weight value is introduced, which indicates the importance of each training sample according to the number of samples in the cluster that contains this cluster center. Finally, the modified samples are used to accomplish KNN text classification. The simulation results show that the algorithm proposed in this paper can not only effectively reduce the actual number of training samples and lower the calculation complexity, but also improve the accuracy of KNN text classification algorithm.
Index Terms—Text classification, KNN algorithm, sample austerity, cluster.
Abstract—The traditional KNN text classification algorithm used all training samples for classification, so it had a huge number of training samples and a high degree of calculation complexity, and it also didn’t reflect the different importance of different samples. In allusion to the problems mentioned above, an improved KNN text classification algorithm based on clustering center is proposed in this paper. Firstly, the given training sets are compressed and the samples near by the border are deleted, so the multipeak effect of the training sample sets is eliminated. Secondly, the training sample sets of each category are clustered by k-means clustering algorithm, and all cluster centers are taken as the new training samples. Thirdly, a weight value is introduced, which indicates the importance of each training sample according to the number of samples in the cluster that contains this cluster center. Finally, the modified samples are used to accomplish KNN text classification. The simulation results show that the algorithm proposed in this paper can not only effectively reduce the actual number of training samples and lower the calculation complexity, but also improve the accuracy of KNN text classification algorithm.
Index Terms—Text classification, KNN algorithm, sample austerity, cluster.
Cite: Zhou Yong, Li Youwen, Xia Shixiong, "An Improved KNN Text Classification Algorithm Based on Clustering," Journal of Computers vol. 4, no. 3, pp. 230-237, 2009.
General Information
ISSN: 1796-203X
Abbreviated Title: J.Comput.
Frequency: Bimonthly
Abbreviated Title: J.Comput.
Frequency: Bimonthly
Editor-in-Chief: Prof. Liansheng Tan
Executive Editor: Ms. Nina Lee
Abstracting/ Indexing: DBLP, EBSCO, ProQuest, INSPEC, ULRICH's Periodicals Directory, WorldCat,etc
Nov 14, 2019 News!
Vol 14, No 11 has been published with online version [Click]
Mar 20, 2020 News!
Vol 15, No 2 has been published with online version [Click]
Dec 16, 2019 News!
Vol 14, No 12 has been published with online version [Click]
Sep 16, 2019 News!
Vol 14, No 9 has been published with online version [Click]
Aug 16, 2019 News!
Vol 14, No 8 has been published with online version [Click]
- Read more>>