МЕТОД НА ОСНОВІ ВИПАДКОВОГО ЛІСУ ДЛЯ ВИЯВЛЕННЯ ЗАЛЕЖНОСТЕЙ ОЗНАК: ПОРІВНЯННЯ З КОРЕЛЯЦІЄЮ ПІРСОНА, ВЗАЄМНОЮ ІНФОРМАЦІЄЮ ТА КОРЕЛЯЦІЄЮ ВІДСТАНЕЙ

A. A. Lytvyn

doi:10.32782/tnv-tech.2025.1.8

Authors

A. A. Lytvyn Private Higher Education Establishment “European University” https://orcid.org/0009-0009-5496-1417

DOI:

https://doi.org/10.32782/tnv-tech.2025.1.8

Keywords:

Machine Learning, Feature Dependencies, Random Forest, Unsupervised Learning

Abstract

In this work, we introduce a novel method for identifying dependent features in datasets that lack a target variable, a critical challenge in unsupervised learning. Understanding feature dependencies is essential in many machine learning applications, including dimensionality reduction, feature selection, and data preprocessing, where capturing both linear and nonlinear relationships among features is necessary. Traditional methods for dependency detection, such as Pearson correlation, mutual information, and the correlation of distances, have been widely used but often exhibit limitations, particularly when dealing with complex, high-dimensional data or non-linear dependencies.Our approach addresses these challenges by leveraging a synthetic dataset generation technique. Specifically, we create synthetic features by sampling from the empirical distributions of the original features. This ensures that synthetic features are statistically independent while preserving the overall structure of the data. We then label the original dataset instances as 1 and the synthetic ones as 0, forming a binary classification problem. A Random Forest classifier is trained to distinguish between these two classes, and the feature importance scores obtained from the trained model provide insights into which features exhibit dependency. Features that contribute significantly to the classification task are identified as dependent, while those with lower importance scores are considered independent.To evaluate the effectiveness of our method, we compare it against well-established dependency detection techniques. Pearson correlation primarily captures linear dependencies, while mutual information and correlation of distances can account for more complex relationships. Our experimental results demonstrate that the proposed approach outperforms these traditional methods by consistently identifying the correct set of dependent features across various tested scenarios. Moreover, our method exhibits greater robustness to noise, making it a reliable tool for unsupervised feature dependency detection in real-world datasets.

References

Breiman, L., Friedman, J., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees (1st ed.). Chapman and Hall/CRC. https://doi.org/ 10.1201/9781315139470

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi. org/10.1023/A:1010933404324

Louppe, G. (2015). Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502.https://arxiv.org/abs/1407.7502

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Touw, W. G., Bayjanov, J. R., Overmars, L., Backus, L., Boekhorst, J., Wels, M., & van Hijum, S. A. F. T. (2013). Data mining in the life sciences with random forest: A walk in the park or lost in the jungle? Briefings in Bioinformatics, 14(3), 315–326. https:// doi.org/10.1093/bib/bbs034

Rodgers, J. L., & Nicewander, W. A. (1988). Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1), 59–66. https://www.stat. berkeley.edu/~rabbee/correlation.pdf

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., Fernández del Río, J., Wiebe, M., Peterson, P., … Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2

Cover, T. M., & Thomas, J. A. (2005). Elements of information theory (pp. 13–55). John Wiley & Sons. https://www.cs.columbia.edu/~vh/courses/LexicalSemantics/ Association/Cover&Thomas-Ch2.pdf

Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical Review E, 69, 066138. https://doi.org/10.1103/ PhysRevE.69.066138

Székely, G. J., Rizzo, M. L., & Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6), 2769–2794. https://doi.org/10.1214/009053607000000505

Ramos-Carreño, C., & Torrecilla, J. L. (2023). dcor: Distance correlation and energy statistics in Python. SoftwareX, 22, 101326. https://doi.org/10.1016/j. softx.2023.101326

A RANDOM FOREST-BASED METHOD FOR DETECTING FEATURE DEPENDENCIES: A COMPARISON WITH PEARSON CORRELATION, MUTUAL INFORMATION, AND DISTANCE CORRELATION

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Language