Emergence of Topological Shortcuts in Machine Learning
Common deep learning frameworks formulate link prediction as a binary classification task, connecting node entities according to their features. The successful training of a binary classifier requires node pairs that are known to bind to each other, as well as negative samples, i.e., pairs that do not interact. Such positive and negative records are usually determined by thresholding continuous variables characterizing the strength of the interaction between two nodes. However, we often observe that link strength is not randomly distributed across the records, but there is correlation between number of annotations and average interaction strength per node. As the annotations commonly follow fat-tailed distributions, the observed correlation drives the hub nodes to have disproportionately more positive links on average, whereas nodes with fewer annotations have more negative examples. Uniformly sampled training datasets affected by this annotation imbalance prompt ML models to learn and predict that some nodes are connected disproportionally more often than others. In other words, ML models learn the connection patterns from the degree of the nodes, neglecting relevant node features. This annotation imbalance offers good performance for the unknown annotations associated with the missing links in the network used for training, a phenomenon we term emergence of topological shortcuts. A key consequence and a signal of such topological shortcuts is the degradation of the performance of an ML model when asked to perform link prediction between novel (i.e., never-before-seen) nodes.
In this talk, I will cover some of the strategies we have developed to control for over-fitting and annotation imbalance, and maximize generalization to unseen nodes.
Bio of the speaker: Dr. Giulia Menichetti is a Senior Research Scientist at the Network Science Institute (Northeastern University), and an Associate Researcher at Brigham and Women’s Hospital (Harvard Medical School). She is a statistical/computational physicist by training, and during her Ph.D. she specialized in Network Science. She currently leads the Foodome project, which aims to track the full chemical complexity of the food we consume and develop quantitative tools to unveil, at the mechanistic level, the impact of these chemicals on our health.