JCP 2018 Vol.13(8): 905-912 ISSN: 1796-203X
doi: 10.17706/jcp.13.8.905-912
doi: 10.17706/jcp.13.8.905-912
Recursive Clustering Using Different Features Sets for Metagenomic Data
Isis Bonet1, Widerman Montoya1, Andrea Mesa-Munera1, Juan Fernando Alzate2
1EIA University, km 2 + 200 Vía al Aeropuerto José María Córdova, Envigado, Antioquia, Colombia.
2Centro Nacional de Secuenciación Genómica-CNSG, Facultad de Medicina, Universidad de Antioquia, Calle 67 Número 53-108, Medellín, Antioquia, Colombia.
Abstract—Metagenomics binning process is a step prior to the taxonomic assignment of metagenomic reads of contigs, which helps to group genome sequences belonging to the same species. In this paper we propose a clustering method that is executed recursively to cluster contigs into groups of same taxa. In each step the method increases the taxonomic level, beginning with a domain and ending with a group that represents the species. The method uses a previous rule-based system to separate virus from the rest of the organism and feature selection algorithms to select different features in each step of the clustering. The clustering is based on k-means++ using Cosine and Jaccard distance, and feature selection on gain information. The proposed method outperforms classic k-means++, achieving 88.15% of purity in clusters.
Index Terms—Binning process, clustering, feature selection, metagenomics.
2Centro Nacional de Secuenciación Genómica-CNSG, Facultad de Medicina, Universidad de Antioquia, Calle 67 Número 53-108, Medellín, Antioquia, Colombia.
Abstract—Metagenomics binning process is a step prior to the taxonomic assignment of metagenomic reads of contigs, which helps to group genome sequences belonging to the same species. In this paper we propose a clustering method that is executed recursively to cluster contigs into groups of same taxa. In each step the method increases the taxonomic level, beginning with a domain and ending with a group that represents the species. The method uses a previous rule-based system to separate virus from the rest of the organism and feature selection algorithms to select different features in each step of the clustering. The clustering is based on k-means++ using Cosine and Jaccard distance, and feature selection on gain information. The proposed method outperforms classic k-means++, achieving 88.15% of purity in clusters.
Index Terms—Binning process, clustering, feature selection, metagenomics.
Cite: Isis Bonet, Widerman Montoya, Andrea Mesa-Munera, Juan Fernando Alzate, "Recursive Clustering Using Different Features Sets for Metagenomic Data," Journal of Computers vol. 13, no. 8, pp. 905-912 , 2018.
General Information
ISSN: 1796-203X
Abbreviated Title: J.Comput.
Frequency: Bimonthly
Abbreviated Title: J.Comput.
Frequency: Bimonthly
Editor-in-Chief: Prof. Liansheng Tan
Executive Editor: Ms. Nina Lee
Abstracting/ Indexing: DBLP, EBSCO, ProQuest, INSPEC, ULRICH's Periodicals Directory, WorldCat,etc
E-mail: jcp@iap.org
-
Nov 14, 2019 News!
Vol 14, No 11 has been published with online version [Click]
-
Mar 20, 2020 News!
Vol 15, No 2 has been published with online version [Click]
-
Dec 16, 2019 News!
Vol 14, No 12 has been published with online version [Click]
-
Sep 16, 2019 News!
Vol 14, No 9 has been published with online version [Click]
-
Aug 16, 2019 News!
Vol 14, No 8 has been published with online version [Click]
- Read more>>