Gene Selection for Breast Cancer Classification Using T-Test Filtering and Wrapper-Based Optimization
DOI:
https://doi.org/10.5281/zenodo.18112300%20Keywords:
Breast cancer classification, Gene expression, Feature selection, T-Test filtering, Wrapper methods, Sequential Forward Selection, Genetic Algorithms, Hill Climbing, Microarray data, BioinformaticsAbstract
Accurate classification of breast cancer using gene expression data is challenged by the high dimensionality and noise inherent in microarray datasets. This study proposes and evaluates a hybrid feature selection pipeline that integrates T-Test filtering with four wrapper-based optimization methods: Sequential Forward Selection (SFS), Sequential Backward Selection (SBS), Genetic Algorithms (GA), and Hill Climbing.
The T-Test was first used to reduce the initial feature space, after which each wrapper method was used to identify an optimal gene subset based on classifier performance. Experimental results show that SFS achieved the highest classification accuracy (98%) and AUC (0.99), while Hill Climbing offered the fastest execution time (1.68 seconds) but with a trade-off in specificity. Notably, the genes SFRP1 and CNTNAP3 were recurrently selected by different methods, suggesting their potential as key biomarkers. To further clarify these findings, we conducted an in-depth analysis of gene overlaps using Jaccard indices and explored the biological roles of selected genes, revealing SFRP1's tumor-suppressive function via Wnt pathway inhibition and CNTNAP3's emerging prognostic value in various cancers.
These findings emphasis the effectiveness of hybrid selection strategies and provide crucial insights into the trade-offs between classifier performance, computational cost, and gene stability in breast cancer classification. This extended analysis incorporates additional performance visualizations, detailed overlap metrics to offer a comprehensive view of hybrid feature selection's applicability in bioinformatics.

