Developing interpretable, accurate, and robust regression and clustering methods

Principal Investigator: Prof. Nelson Baloian

University: Department of Computer Sciences, University of Chile

Research team: Ashot Harutyunyan, Arnak Poghosyan, Karen Petrosyan, Edgar Davtyan, Aram Adamyan, Alexander Aramyan

Contributing researchers: Maral Chahverdian, Aneta Baloyan

Duration: 2023-2026

Participating and hosting partner: American University of Armenia

Project Importance

Both high performance and interpretability are desirable features for a data science instrument, although experience shows that the search for a high performing method makes it difficult to achieve interpretability and vice versa. Interpretability is an essential characteristic when users want to obtain more information about the phenomenon generating the data being studied. It also allows humans to trust and efficiently work with machine learning.

On the one hand, machine learning methods with high accuracy like Gradient Boosting, Random Forest and those based in Neural Networks like Deep Learning are considered black-box algorithms due to their lack of inherent interpretability.

On the other hand, there are Linear Regression and Decision Trees methods that are recognized for their high and inherent interpretability, but linear models are limited since they cannot learn non-linear relationships between attributes and Decision Trees should have a reduced depth to be interpretable, which make them less accurate. The development of general methods that consistently obtain good results regarding both aspects has attracted the interest of some authors already, however, most of the existing works suffer from trade-offs between precision and interpretability. Another important desirable characteristic for machine learning methods is the robustness, which is the ability of the model to deal with data with missing values. In the real world, many data sets will contain records for which some values were not registered. Most learning machine methods cannot deal with such samples, thus being not able to process them, unless their missing values are filled with default values (like mean on median values) which may affect the accuracy of both the training and the target data. Another common approach has been to ignore samples with missing values which means that the training data set must be reduced, affecting the model’s performance.

Importance scoring of features as core attributes of the learned models is another essential aspect in terms of their globalexplainability. It indicates the overall utility degree of a particular variable in the data set in the prediction mechanics of the method. Throughout the project, the team is going to investigate this global interpretability methodology as well in view of the main elements of the proposal such as robustness and the plausibility angle. This might lead to a novel feature importance scoring mechanism both in regression and clustering settings, which then can be validated on real-world data sets and analyzed against the state-of-the-art ways of feature ranking.

Expected Results

The project will foster the development of machine learning models which, apart from demonstrating high performance, will also be robust for dealing with samples with missing values and will be inherently interpretable, which means that the model produces explanations about the criteria used to produce an outcome without needing further process or analysis. Beside this, the project will have following outcomes:

Promote the development of capacities in Armenia, which will be able not only to successfully apply machine learning to a variety of problems in various domains but also to contribute to the development of the discipline at a global level.
Formation of a team of specialized individuals with strong, holistic knowledge in all aspects of machine learning, from development of models to the application and evaluation of them in various practical domains.
Organization of a summer school on ML (during the first year) with the participation of colleagues from Germany and Japan.
Organization of the fourth CODASSCA workshop on Data Science in the year 2024 making it more representative.
Publish at least 3 journal and 3 conference papers targeting top-quality, internationally acclaimed journals and conferences Establish cooperation activities with companies (VMware), public institutions (Ministry of Healthcare) and NGOs (Acopian Center for the Environment) partners, and universities (AI Lab at YSU), as well as other research groups working on data Science in Armenia to create synergies.