Transformational machine learning: Learning how to learn from many related scientific problems

Olier, Ivan; Orhobor, Oghenejokpeme I.; Dash, Tirtharaj; Davis, Andy M.; Soldatova, Larisa N; Vanschoren, Joaquin and King, Ross D.. 2021. Transformational machine learning: Learning how to learn from many related scientific problems. Proceedings of the National Academy of Sciences, 118(49), e2108013118. ISSN 0027-8424 [Article]

e2108013118.full.pdf - Published Version
Available under License Creative Commons Attribution.

Download (1MB) | Preview

Abstract or Description

Almost all machine learning (ML) is based on representing examples using intrinsic features. When there are multiple related ML problems (tasks), it is possible to transform these features into extrinsic features by first training ML models on other tasks and letting them each make predictions for each example of the new task, yielding a novel representation. We call this transformational ML (TML). TML is very closely related to, and synergistic with, transfer learning, multitask learning, and stacking. TML is applicable to improving any nonlinear ML method. We tested TML using the most important classes of nonlinear ML: random forests, gradient boosting machines, support vector machines, k-nearest neighbors, and neural networks. To ensure the generality and robustness of the evaluation, we utilized thousands of ML problems from three scientific domains: drug design, predicting gene expression, and ML algorithm selection. We found that TML significantly improved the predictive performance of all the ML methods in all the domains (4 to 50% average improvements) and that TML features generally outperformed intrinsic features. Use of TML also enhances scientific understanding through explainable ML. In drug design, we found that TML provided insight into drug target specificity, the relationships between drugs, and the relationships between target proteins. TML leads to an ecosystem-based approach to ML, where new tasks, examples, predictions, and so on synergistically interact to improve performance. To contribute to this ecosystem, all our data, code, and our ∼50,000 ML models have been fully annotated with metadata, linked, and openly published using Findability, Accessibility, Interoperability, and Reusability principles (∼100 Gbytes).

Item Type:


Identification Number (DOI):

Additional Information:

Copyright © 2021 the Author(s). Published by PNAS. This open access article is distributed under Creative Commons Attribution License 4.0 (CC BY).

This article contains supporting information online at

To enable reproducibility, all of the thousands of datasets (QSAR, LINCS, and Metalearning), the links to the code (TML, RF, XGB, SVM, k-NN, NN), and the ∼50,000 ML RF (counting all decision trees) models are available under the creative commons license at the Open Science Platform: This amounts to ∼100 Gbytes of compressed data. Few ML projects have put online so much reusable data. To maximize its added value we follow the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles for publishing digital objects (41) (see SI Appendix, FAIR Sharing)

Datasets, code, and ML models reported in this study have been deposited in Open Science Framework (

The work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, by the Engineering and Physical Sciences Research Council projects Robot Chemist and Action on Cancer, and by the Alan Turing Institute project Spatial Learning: Applications in Structure Based Drug Design.


AI, drug design, transfer learning, stacking, multitask learning

Departments, Centres and Research Units:



7 October 2021Accepted
29 November 2021Published Online
December 2021Published

Item ID:


Date Deposited:

20 Dec 2021 16:14

Last Modified:

20 Dec 2021 16:14

Peer Reviewed:

Yes, this version has been peer-reviewed.


View statistics for this item...

Edit Record Edit Record (login required)