A RANDOM FOREST-BASED METHOD FOR DETECTING FEATURE DEPENDENCIES: A COMPARISON WITH PEARSON CORRELATION, MUTUAL INFORMATION, AND DISTANCE CORRELATION
DOI:
https://doi.org/10.32782/tnv-tech.2025.1.8Keywords:
Machine Learning, Feature Dependencies, Random Forest, Unsupervised LearningAbstract
In this work, we introduce a novel method for identifying dependent features in datasets that lack a target variable, a critical challenge in unsupervised learning. Understanding feature dependencies is essential in many machine learning applications, including dimensionality reduction, feature selection, and data preprocessing, where capturing both linear and nonlinear relationships among features is necessary. Traditional methods for dependency detection, such as Pearson correlation, mutual information, and the correlation of distances, have been widely used but often exhibit limitations, particularly when dealing with complex, high-dimensional data or non-linear dependencies.Our approach addresses these challenges by leveraging a synthetic dataset generation technique. Specifically, we create synthetic features by sampling from the empirical distributions of the original features. This ensures that synthetic features are statistically independent while preserving the overall structure of the data. We then label the original dataset instances as 1 and the synthetic ones as 0, forming a binary classification problem. A Random Forest classifier is trained to distinguish between these two classes, and the feature importance scores obtained from the trained model provide insights into which features exhibit dependency. Features that contribute significantly to the classification task are identified as dependent, while those with lower importance scores are considered independent.To evaluate the effectiveness of our method, we compare it against well-established dependency detection techniques. Pearson correlation primarily captures linear dependencies, while mutual information and correlation of distances can account for more complex relationships. Our experimental results demonstrate that the proposed approach outperforms these traditional methods by consistently identifying the correct set of dependent features across various tested scenarios. Moreover, our method exhibits greater robustness to noise, making it a reliable tool for unsupervised feature dependency detection in real-world datasets.
References
Breiman, L., Friedman, J., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees (1st ed.). Chapman and Hall/CRC. https://doi.org/ 10.1201/9781315139470
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi. org/10.1023/A:1010933404324
Louppe, G. (2015). Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502.https://arxiv.org/abs/1407.7502
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Touw, W. G., Bayjanov, J. R., Overmars, L., Backus, L., Boekhorst, J., Wels, M., & van Hijum, S. A. F. T. (2013). Data mining in the life sciences with random forest: A walk in the park or lost in the jungle? Briefings in Bioinformatics, 14(3), 315–326. https:// doi.org/10.1093/bib/bbs034
Rodgers, J. L., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1), 59–66. https://www.stat. berkeley.edu/~rabbee/correlation.pdf
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., Fernández del Río, J., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2
Cover, T. M., & Thomas, J. A. (2005). Elements of information theory (pp. 13–55). John Wiley & Sons. https://www.cs.columbia.edu/~vh/courses/LexicalSemantics/ Association/Cover&Thomas-Ch2.pdf
Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69, 066138. https://doi.org/10.1103/ PhysRevE.69.066138
Székely, G. J., Rizzo, M. L., & Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6), 2769–2794. https://doi.org/10.1214/009053607000000505
Ramos-Carreño, C., & Torrecilla, J. L. (2023). dcor: Distance correlation and energy statistics in Python. SoftwareX, 22, 101326. https://doi.org/10.1016/j. softx.2023.101326