\journaltitle

Briefings in Bioinformatics \DOIDOI HERE \accessAdvance Access Publication Date: Day Month Year \appnotesPaper

\authormark

Author Name et al.

\corresp

[\ast]Corresponding author. [email protected]

0Year 0Year 0Year

A review of Feature Selection on Knowledge Graphs

Sisi Shao\ORCID0009-0000-9783-9205    Pedro Henrique Ribeiro    Christina Ramirez\ORCID0000-0002-8435-0416    Jason H. Moore\ORCID0000-0002-5015-1099 \orgdivDepartment of Biostatistics, \orgnameFielding School of Public Health at University of California, Los Angeles, \orgaddress\street650 Charles E Young Dr S, \postcode90095-1772, \stateCalifornia, \countryCountry \orgdivDepartment of Computational Biomedicine, \orgnameCedars-Sinai Medical Center, \orgaddress\street8700 Beverly Blvd, \postcode90048, \stateCalifornia, \countryUnited States
(2024; 2024; Date; Date; Date)

A review of feature selection strategies utilizing graph data structures and knowledge graphs

Sisi Shao\ORCID0009-0000-9783-9205    Pedro Henrique Ribeiro    Christina Ramirez\ORCID0000-0002-8435-0416    Jason H. Moore\ORCID0000-0002-5015-1099 \orgdivDepartment of Biostatistics, \orgnameFielding School of Public Health at University of California, Los Angeles, \orgaddress\street650 Charles E Young Dr S, \postcode90095-1772, \stateCalifornia, \countryCountry \orgdivDepartment of Computational Biomedicine, \orgnameCedars-Sinai Medical Center, \orgaddress\street8700 Beverly Blvd, \postcode90048, \stateCalifornia, \countryUnited States
(2024; 2024; Date; Date; Date)
Abstract

Feature selection in Knowledge Graphs (KGs) are increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems. This paper delves into the methodologies for feature selection within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy, hypothesis generation, and interpretability. Through this comprehensive review, we aim to catalyze further innovation in feature selection for KGs, paving the way for more insightful, efficient, and interpretable analytical models across various domains. Our exploration reveals the critical importance of scalability, accuracy, and interpretability in feature selection techniques, advocating for the integration of domain knowledge to refine the selection process. We highlight the burgeoning potential of multi-objective optimization and interdisciplinary collaboration in advancing KG feature selection, underscoring the transformative impact of such methodologies on precision medicine, among other fields. The paper concludes by charting future directions, including the development of scalable, dynamic feature selection algorithms and the integration of explainable AI principles to foster transparency and trust in KG-driven models.

keywords:
Feature Selection, Knowledge Graphs, Deep Learning, Precision Medicine, Explainable AI.

1 Introduction

1.1 Brief Introduction to Knowledge Graphs

In the era of large-scale digital information, Knowledge Graphs (KGs) are an increasingly popular tool for organizing data and informationchicaiza2021comprehensive . At their core, KGs are characterized by representing entities and their relationships through triplets (subject-predicate-object), allowing for in-depth data analysis and the development of personalized care strategies. For instance, a triplet like “Cyclophosphamide - treats - Cancer” demonstrates KGs’ potential in drug discovery and repurposing. Platforms like Bio2RDF have been instrumental in exploring the complex relationships between genetics, diseases, and environmental factors. KGs thereby facilitate a comprehensive approach to healthcare; this approach supports a wide range of applications, from advanced decision-support systems to personalized medicine and innovative drug discovery methods (belleau2008bio2rdf, ; hasan2020knowledge, ).

One of the most well-known uses for KGs is in the development of web-based technologies, including search engines and the Semantic Web (an extension of the World Wide Web that enables data to be shared and reused across applications). Google KG, DBpedia, and Yet Another Great Ontology (YAGO) utilize the principles of the Semantic Web and Linked Open Data (LOD–a method of publishing structured data so that it can be interlinked and become more useful) to create extensive networks of nodes and edges, representing the intricate relationships within vast datasets and enabling enhanced query processing and analytics capabilities. The contributions of scholars such as Fensel et al. fensel2020introduction , Bonner et al. bonner2022understanding , and Yang et al. yang2023comprehensive have been crucial in shedding light on the foundational aspects and ongoing evolution of these systems.

As biology continues to advance, we’re accumulating a vast amount of knowledge about genes, proteins, chemicals, cells, diseases, and other biological entities along with their complex interactions which are intricate and multifacetedlevine2019biological . To make sense of this complexity, KGs have emerged as a powerful tool for organizing and connecting this information. In the realm of precision medicine, KGs have been used to consolidate disparate biomedical data, and been applied to improving the effectiveness of personalized patient care by systematically utilizing genetic, environmental, and lifestyle information. This application is exemplified by PrimeKG, which significantly contributes to creating a comprehensive medical knowledge base by integrating a wide ontology with data from various sources, including genomic databases, thereby supporting detailed medical research and personalized care planning (chandak2023building, ).

At their core, KGs are characterized by representing entities and their relationships through triplets (subject-predicate-object), allowing for in-depth data analysis and the development of personalized care strategies. For instance, a triplet like “Cyclophosphamide - treats - Cancer” demonstrates KGs’ potential in drug discovery and repurposing. Platforms like Bio2RDF have been instrumental in exploring the complex relationships between genetics, diseases, and environmental factors. KGs thereby facilitate a comprehensive approach to healthcare; this approach supports a wide range of applications, from advanced decision-support systems to personalized medicine and innovative drug discovery methods (belleau2008bio2rdf, ; hasan2020knowledge, ).

The integration and analysis of data from biomedical research and clinical practice through KGs provide a dynamic platform for advancements in understanding and treating diseases. The academic discourse on feature selection methods applied to KGs, as highlighted by the studies referenced, underscores their transformative potential in various domains, particularly in advancing personalized medicine and healthcare outcomes.

1.2 Importance of Feature Selection

Feature selection involves choosing a subset of input variables most relevant for analysis, crucial in modern ML research due to the vast amounts of data ranging from petabytes to exabytes. As datasets grow in size and complexity, identifying important attributes is essential to address the ”curse of dimensionality” (bellman1966dynamic, ), which can degrade model performance. Reducing the feature set helps mitigate overfitting and improves computational efficiency (ferreira2012efficient, ). This reduction aids model interpretability in critical domains like healthcare and finance (lahmiri2016features, ; huda2016hybrid, ) and enhances the model’s generalizability to new data, a cornerstone for practical applications (forster2000key, ; saari2010generalizability, ). Streamlined models, requiring fewer computational resources, are beneficial in resource-constrained scenarios like edge computing (bikku2016hadoop, ; mohammed2020edge, ). With big data’s growing influence, especially in healthcare projected to reach $79.23 billion by 2028, feature selection is increasingly vital to ensure robust and applicable models.

Often in ML, feature selection refers to selecting columns of a tabular dataset. In this paper, we take a broader view, including selecting nodes or entities for hypothesis generation and further investigation. For example, a knowledge graph (KG) with genes and diseases can hypothesize new subsets of genes related to a specific disease.

Recognizing various feature selection methods, such as algorithmic techniques, statistical analyses (jovic2015review, ), and expert insights, we now explore the relationship between KGs and feature selection, highlighting how these frameworks can enhance the feature selection process.

Refer to caption
Figure 1: An integrated overview of KGs encompassing RDF structuring, Ontological frameworks, and GDB management, illustrating the flow from data sources to semantic querying and storage. Figure 1 delineates the contribution of varied scholarly and scientific data sources—such as Google Scholar, PubMed, arXiv, and DrugBank—in providing raw data inputs. These inputs are then semantically encoded via the RDF, using triples that consist of subjects, predicates, and objects, alongside URIs that ensure the unique identification and integration of data entities across the KG. At the heart of the semantic structure are ontologies, exemplified here by the Unified Medical Language System (UMLS), which define the schema for the KG by outlining the essential relationships and attributes of the domain-specific entities. This ontology-based schema informs the organization and representation of knowledge within GDBs, such as Neo4j, which are specialized for storing and operationalizing the complex relational data of KGs. The central round-edged box showcases the role of query languages, with Cypher portrayed as a model for extracting information from GDBs through its intuitive syntax and pattern matching capabilities. The graphic elucidation of the query output illustrates a network of nodes and edges, representing the intricate interrelations and potential analytical insights derived from KGs. Each cluster within the network, designated as A, B, and C, symbolizes distinct subsets or aspects of the graph database that have been queried.

1.3 Overview of the Relationship between Knowledge Graphs and Feature Selection

The integration of KGs with feature selection processes marks a pivotal advancement in the realm of ML, particularly enhancing the capabilities of predictive models. Notably, many AI/ML systems remain largely unaware of domain-specific knowledge, such as biomedical information, which humans routinely leverage to solve complex problems. This oversight highlights the potential of KGs, with their rich web of entities, attributes, and interconnections, to bridge this gap. KGs play a crucial role across diverse domains such as the Semantic Web, NLP, and comprehensive data integration efforts, providing a structured representation that significantly aids in the precision of feature selection. This critical phase in ML aims at pinpointing the most relevant data attributes to optimize model performance, reduce over-fitting, and enhance interpretability.

However, integrating KGs into feature selection is challenging, encompassing issues of scalability, KG integrity, and adaptation to diverse domains. These challenges call for a concerted research effort aimed at developing scalable algorithms that efficiently navigate expansive KGs, enhance KG completeness, and foster the integration of varied data sources. This multidisciplinary arena benefits immensely from the combined expertise in knowledge representation, ML, and domain-specific areas, underscoring the critical need for a harmonious blend of structured knowledge with empirical, data-driven approaches.

To this end, exploring innovative methodologies becomes paramount. Approaches such as embedding-based feature selection and the application of graph neural networks (GNNs) demonstrate the potential of leveraging KGs’ unique characteristics for feature selection. These methodologies offer scalable and effective solutions for managing the high-dimensional spaces inherent to KGs, thus facilitating a more nuanced and comprehensive analysis of data.

Moreover, the dynamic nature of KGs, with their constantly evolving entities and relationships, necessitates feature selection methods that are not only adaptive but also capable of real-time updates. This adaptability ensures the relevance and efficacy of selected features in the face of new information, thereby maintaining the integrity and applicability of ML models in rapidly changing scenarios.

2 Background and Key Concepts

2.1 Definition and Structure of Knowledge Graphs

KGs categorize and link data for domain-specific knowledge discovery.

2.1.1 Ontologies

KGs use ontologies to define relationships and model semantics (staab2010handbook, ). Ontologies categorize concepts, allowing flexible queries. Bio2RDF, for example, defines classes like ”proteins” and ”chemical entities,” and their relationships using RDF triples.

2.1.2 Example: Bio2RDF

Bio2RDF integrates datasets like DrugBank (wishart2018drugbank, ), SIDER (kuhn2016sider, ), and KEGG (kanehisa2000kegg, ) into a unified RDF structure, enhancing data interoperability and supporting complex queries.

  • Nodes: Tagged with URIs, representing biomedical entities like genes and drugs.

  • Relationships: Include ”targets” and ”is affected by,” illustrating drug-protein interactions and genetic influences.

2.2 Structuring Domain Knowledge with RDF

2.2.1 RDF

RDF provides a structure for semantic representation in KGs (bodenreider2004unified, ). It formalizes relationships as triplets (subject-predicate-object) forming a graph G={(s,p,o)}𝐺𝑠𝑝𝑜G=\{(s,p,o)\}italic_G = { ( italic_s , italic_p , italic_o ) }. RDF enhances data interlinking and queryability (donnelly2006snomed, ; nelson2011normalized, ).

2.2.2 Ontologies

Ontologies in KGs categorize and describe concepts with flexible relationships. They enhance querying capabilities by defining both specific and abstract relationships, as seen in Bio2RDF and AlzKB.

2.3 Leveraging Graph Databases

Graph Databases (GDBs) like Neo4j manage complex data relationships within KGs, enabling efficient semantic analysis (miller2013graph, ). Freebase and query languages like Cypher and SPARQL extend GDB functionality for intuitive querying (bollacker2008freebase, ; francis2018cypher, ).

2.4 Visual Demonstration of ADKGs of Varying Sizes

We use AlzKb, an Alzheimer’s Disease KG, to demonstrate varying KG sizes. Figures represent tiny (Cypher query limit 8), small (Cypher query limit 15), and medium (CYpher query limit 200) KGs. A tiny KG example is shown in Figure 2.

Refer to caption
Figure 2: A Tiny-sized ADKG (Yellow Node: AD; Purple Nodes: Genes; Green Nodes: Drugs) (alzheimersknowledgebase, ). There are five instances of the “Chemical binds gene” relationship (light purple arrows), where a chemical is shown to interact directly with a gene; six instances of the “Gene associates with disease” relationship (yellow arrows), representing genes that have an association with AD; one instance of the “Chemical decreases expression” relationship (dark green arrow), indicating a chemical that downregulates or decreases the expression of a gene; one instance of “Gene regulates gene” (purple arrow), suggesting a regulatory interaction between two genes, PPARG and TPI1. More detailed information on genes and drugs is given in the Appendix B.

3 Feature Selection on Knowledge Graphs

We categorize and evaluate the methodological frameworks delineated within the referenced manuscripts in this section. Below we elaborate on four distinct KG feature selection methods, including search algorithms, similarity-based methods, vector embeddings, and advanced network representation learning, available in the most current literature to the best of our present knowledge.

3.1 Causal Discovery-Search Algorithm

The goal of causal discovery is to move beyond merely describing correlated events to identifying the direction of influence between observed phenomena. The challenge in causality analysis lies in capturing the complex interactions between variables. Typically, these relationships are formalized using causal graphs, where nodes represent variables and directed edges denote causal effects.

In medicine, the gold standard for establishing causal relationships, including confounding, collider, mediation, moderation, reverse causality, effect modification, causal chain, and causal graph, is through randomized controlled trials (RCTs). However, various analytical methods can infer causal relationships from observational data. Researchers must consider other measured or unmeasured variables that may act as confounders, mediators, or colliders. For a comprehensive review of causal discovery, we recommend this survey paper by Zanga et al. zanga2022survey .

There has been a lot of work recently on building automated methods, generally utilizing natural language processing (NLP) techniques, to extract causal relations from the scientific literature. These KGs can then be used to consolidate knowledge and form inferences and hypotheses about how different variables interact. Causal analysis can then be used to identify features that have causal effects on downstream variables.

The study by Malec et al. malec2023causal introduced a novel causal feature selection framework using the ”ADKG” knowledge graph. This ADKG was constructed from post-2010 PubMed biomedical literature and an ontology-grounded KG via the PheKnowLator workflow (callahan2019pheknowlator, ). The authors used PubMed identifiers and machine reading systems like EIDOS, REACH, and SemRep within the INDRA ecosystem (gyori2017word, ) to extract data. INDRA assembles knowledge into a model of causal molecular interactions (perez2009semantics, ), resulting in an OWL ontology (horrocks2005owl, ).

The study aimed to enhance causal feature selection with the ADKG. Hygiene steps were performed, and logical entailments were initially omitted. Predicates were mapped to the Relation Ontology (RO) to provide logical definitions and infer additional knowledge. Forward-chaining inference using CLIPS generated new triples based on RO properties, with belief scores assigned. Integration with PheKnowLator facilitated path search algorithms, reweighting edges with hierarchical relationships for optimized path searches. Competency questions, such as causal relationships between depression and AD, were addressed using SPARQL queries and Dijkstra’s shortest path algorithm (perez2009semantics, ; ducharme2013learning, ).

Dijkstra’s algorithm applied to the ADKG identified shortest paths connecting genes and diseases, highlighting direct relationships (malec2023causal, ). These paths were analyzed to identify potential confounders, colliders, and mediators. Confounders influence both exposure and outcome, colliders are influenced by both, and mediators act as intermediaries. Figure 3 illustrates identifying a potential confounder between AD and depression using Dijkstra’s algorithm. The study identified 126 unique potential confounders, 29 colliders, and 18 potential mediators, showcasing the ADKG’s ability to uncover intricate relationships that traditional searches might miss.

Refer to caption
Figure 3: Illustration of Inflammatory Response (pink node) as a Potential Confounder in the Association Between AD (left yellow node) and Depression (right yellow node). The diagram represents the shortest paths (through orange nodes) identified by Dijkstra’s algorithm. The two green paths also connect inflammatory response with AD and Depression but both of them are one unit longer than the orange ones. Consequently, Dijkstra’s algorithm picks the shortest path.

3.2 Feature Selection-Dimensionality Reduction

KGs can be utilized to perform feature selection for high-dimensional tabular datasets. In this scenario, nodes in the graph may relate to the columns, or features, of the tabular dataset. These subsets of features then be analyzed with methods such as machine learning. Below, we outline a few examples of graph-based methods for selecting subsets of features.

  • Fang et al. fang2019diagnosis developed an information theory approach informed by a KG to select features for training machine learning models. The goal of their study was to develop a predictive model of Chronic Obstructive Pulmonary Disease (COPD) from a tabular dataset including twenty-eight features representing medical tests and patient symptoms. First, a KG was constructed by integrating electronic medical records (EMRs) and domain-specific biomedical knowledge to identify and represent relationships among diseases, symptoms, causes, risk factors, drugs, side effects, and more. The features of the tabular dataset corresponded to nodes in the KG. Their algorithm, CMFS-η𝜂\etaitalic_η, uses the weights between features in the KG to iteratively add or remove features from the set according to an information-theory-based heuristic. The study used this approach to select subsets of the corresponding features of the tabular dataset to train an SVM model.

  • Ma et al. ma2020knowledge sought to develop a model to predict whether a given Android app contained malware based on the Android API calls contained in the source code. First, they used the official documentation to construct a KG containing all API entities, such as classes and methods, as well as relationships between entities, such as return types and inheritance. Next, they identified a set of permissions considered to be highly sensitive that was required for each API entity. The study created a binary feature vector for each application based on whether or not a given entity was present in the code. To reduce the size of the binary feature vector, the authors selected only entities that were between one to four hops from a node requiring sensitive permission. ¡As not all entities contained explicit links in the documentation, an LSTM model was used to identify an additional subset of entities that shared similar descriptions with entities that require sensitive permissions.¿ This feature vector could then be used to train a classification model. A detailed description of how sensitive APIs, nodes in the KGs, are selected is shown in Figure 4.

    Refer to caption
    Figure 4: Example of Direct and Indirect Dangerous APIs Selection Enabled by the Android API KG. The golden-orange rounded rectangle in the figure signifies a dangerous API called “getCall-CapablePhoneAccounts,” which facilitates the retrieval of Phone Account Handles for making and receiving calls. The light-yellow rounded rectangles are APIs directly connected to the Dangerous API, up to four degrees of separation through hyperlinks, with the understanding that links beyond this do not markedly enhance classification accuracy. The Siamese-BiLSTM network comes into play by identifying indirectly connected, potentially dangerous APIs—represented by the red rounded rectangle, such as “READ SMS,” which allows reading SMS messages but lacks a direct hyperlink or descriptive connection to other APIs. By embedding API descriptions into a vector space using Word2Vec and processing them through a Bidirectional LSTM, the network encodes the APIs’ textual data from both directions for a full context capture. These encoded vectors are then condensed through a dense layer into a final representation. Comparing these representations enables the network to detect hidden APIs that, while not directly linked, share sensitive characteristics with the known dangerous API, thereby revealing hidden dangers through textual similarity rather than explicit interlinking.
  • Jaworsky et al. jaworsky2023interrelated , developed an unsupervised approach for selecting significantly interrelated features and eliminating redundant features from a KG, which they applied to a health survey dataset published by the Behavioral Risk Factor Surveillance System (BRFSS) (centers2014behavioral, ). The algorithm works by iteratively ranking scores based on their connections. This feature selection approach is divided into four main steps outlined in Figure 5.

    Refer to caption
    Figure 5: Illustration of interrelated feature selection procedure. 1) In the data filtering step as shown in part (a), states lacking lung cancer cases are excluded after referencing previous surveys spanning several years. 2) Features with over 50% missing values are eliminated. Then a KG is constructed from the remaining features. 3) A KG driven algorithm is used to transform the health survey question list to a data set with significantly interrelated features. 4) Finally, a binary relevance classifier (a special case of multi-label classifier) is proposed to predict the likelihood of multiple diseases by identifying 1-to-many cancer relationship. In part (b), the KG driven algorithm starts with the initial threshold 100%. Then it loops through existing features and compute weights for each (features with more edges will get more weights). By sorting the weights, the features with highest weights are kept and the threshold is subtracted by 1%. The algorithm is iterated until the stopping criterion is met.
  • In the Hadith Corpus KG created by Mohammed et al. atef2022feature , nodes represent distinct features and semantic categories derived from Hadith texts. Features include specific Islamic terms like “prayer” or “fasting,” while categories encompass broader thematic areas like rituals, ethics, jurisprudence, and other domains of Islamic scholarship. Edges in this KG quantify associations between features and categories based on co-occurrence frequency.

    Refer to caption
    Figure 6: A demonstration of simplified ACO (Ant Colony Optimization) feature selection on Hadith Corpus KG. Here, two ants named Ben and Joe traverse the KG, with Ben starting at the “Zakat” node and moving to “Fasting” across iterations, and Joe beginning his journey at a randomly selected node “Sawm”. The pheromone and heuristic values, represented by the green and red numbers above and below the edges, are aggregated outcomes of the explorations conducted by all ants in the system. Parameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β determine the relative influence of pheromone trails and heuristic information, respectively, while the evaporation rate ρ𝜌\rhoitalic_ρ ensures flexibility in pathfinding, preventing premature convergence on suboptimal routes. The collective pheromone deposit ΔΔ\Deltaroman_Δ between “Fasting” and “Sawm” by Ben and Joe is a cumulative measure reflecting the alignment of the Hadith content with specific categories, denoted by the pink nodes. The probability that Ben chooses “Sawm” as the next feature is computed as a normalized version of Pheromoneα×HeuristicsβsuperscriptPheromone𝛼superscriptHeuristics𝛽\text{Pheromone}^{\alpha}\times\text{Heuristics}^{\beta}Pheromone start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT × Heuristics start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT (see the middle right of the figure). In this instance, the focus is to reinforce the linkage between fasting-related Hadiths and the “Physical Acts” category, differentiating it from the “Spiritual Practice” category and the “Forms of Worship” category, which are more aligned with spiritual benefits and devotional acts.

    Feature selection for text classification is guided by Ant Colony Optimization (ACO) (dorigo2005ant, ; dorigo2006ant, ; dorigo2019ant, ). ACO is a probabilistic technique for solving computational problems which can be reduced to finding good paths through graphs. Inspired by the behavior of ants finding the shortest path from their colony to food sources, ACO is a part of swarm intelligence methods and a subset of evolutionary algorithms. Initially, several paths are randomly constructed, and after traversing a path, an ant deposits Pheromones along it (typically inversely proportional to path length), so shorter paths receive more pheromones. Over time, the pheromones evaporate, reducing their attractive strength to prevent premature convergence. When choosing their paths, ants probabilistically prefer paths with stronger pheromone concentrations while also exploring new paths to avoid local optima. The process is repeated until convergence. In this way, ACO balances between exploring new feature paths (exploration) and intensifying the search around promising features found in previous iterations (exploitation), adapting dynamically to find optimal feature sets for text classification (parpinelli2002data, ; martens2007classification, ; aghdam2009text, ; onan2023srl, ). The pheromone trail and PageRank-like heuristic measure guide this optimization. We provide a graphical illustration of the ACO feature selection process in Figure 6.

    This study demonstrates that integrating ACO into Arabic text classification yields a notable 3% average increase in accuracy, F1 score, recall, and precision compared to conventional methods like Naive Bayes, Random Forest, Decision Trees, and XGBoost, contributing significantly to the field of Arabic text classification.

3.3 Data Linking and Data Integration-Similarity Based Methods

  • Data linkage and data integration refer to the process of combining different sources of datachang2018making . As KGs are developed to summarize large data, they can be great, easy-to-use tools for adding additional data and context to make ML workflows. For example, features of a given dataset can be expanded to include additional information per sample based what we know about a given feature. In Li et al. li2020feature , the study collected data on self-reported student anxiety levels as well as basic information such as age, gender, grade, and home address. They then used the “Own-Think KG” (see Figure 7) as well as “DBpedia,” both known for their credibility and encyclopedic nature, to identify other features for their analysis based on the home address, including weather, population size, and GDP at both the district and regional area levels. These KGs follow a clear and explainable three-tuple storage structure, consisting of entities, attributes, and values, making them suitable for non-numerical feature generation. Importantly, they offer online querying capabilities, eliminating the need to download extensive datasets (auer2007dbpedia, ).

3.4 Knowledge Graph Embeddings-Vector Embeddings

The embedding-focused approach in feature selection, exemplified by methods like the DistMult yang2014embedding , ComplEX (trouillon2016complex, ), TransE (bordes2013translating, ), and RESCAL (nickel2011three, ), and FeaBI ismaeil2023feabi , RippleNet (wang2018ripplenet, ) seeks to represent nodes in a continuous vector space that capture deep semantic relationships and properties. This is a similar concept to word embeddings. Whereas in word embeddings, similar vectors capture similar semantic meaning, with similar words having similar representation, graph node embeddings capture relationship similarity within the graph network. The approach is popular for various applications, including link prediction (kumar2020link, ) and entity classification (al2020named, ). Link prediction serves several purposes, from selecting movies a user would be interested in, to predicting drug-target interactions. Several methods have been developed to leverage embeddings for recommendation algorithms.

Refer to caption
Figure 7: Own-Think KG Advantage over Tradition One-hot Encoding. Consider a dataset that includes information about various cities, Beijing, Shanghai, and Hong Kong, where each city is represented by non-numerical discrete features such as its name. In a traditional dataset, this name might be converted into a numerical form using techniques like one-hot encoding. However, this process strips the city’s name of any contextual information about the city itself. Using a KG like the Own-Think KG, we can query additional information about each city to enrich the features, such as geographical, economic, demographic, cultural features, and so on.
Refer to caption
Refer to caption
Refer to caption
Figure 8: Demonstration of Non-numeric Discrete Features Enrichment and Selection by Own-think KG. The figure includes enriched information for Beijing, Hong Kong, and Shanghai. For example, the additional features for Shanghai provided by the Own-Think KG (see Figure 7) detail Shanghai’s population size, average temperature, latitude, longitude, and GDP. contribute to a richer, more nuanced profile of Shanghai, compared to a one-hot encoding representation of each city, and offer additional insight as to how each aspect of a city may relate to the analysis at hand.

Embedding via DistMult:

  1. 1.

    The DistMult method, designed to predict missing relationships or facts within a KG (chen2020knowledge, ), embeds entities and their interactions as vectors, inherently performing feature selection by:

    • Capturing Semantic Similarities: Entities with closer interactional kinship within the KG are embedded proximately, emphasizing features underlying these semantic similarities.

    • Highlighting Relevant Interactions: DistMult accentuates features defining the interactions, such as biological pathways or chemical properties relevant to the interaction.

  2. 2.

    Optimization of Feature Representation: The DistMult training process fine-tunes the entity and relation representations in the vector space, adjusting the significance of various attributes to enhance model accuracy.

  • One relatively simple strategy for edge prediction is to first create embeddings for each node and then train a classification algorithm to predict whether or not a connection exists between two nodes given their embeddings. For example, Wang et al. wang2022kg utilized this strategy to predict drug-target interactions. In this study, the authors created node embeddings from a KG that contained known drug-target interactions. Next, they trained a deep learning model that took in a pair of embeddings (one drug and one target) to predict whether or not this pair was an existing edge in the graph. The authors showed that the model was able to identify some known interactions that were removed from the training set.

  • A unique example comes from Wang et al., who proposed a hybrid KG embedding and path-based method in a recommendation algorithm they named RippleNet (wang2018ripplenet, ). In this context, the KG contains nodes representing items that can be recommended, for example, movies, along with other nodes representing other features associated with each item, such as actors, genres, or release date. Edges associations between items and features, for example, a movie and its actors. In addition, there is a separate matrix that contains the interactions between each user and item. The goal is to predict the likelihood of a user selecting an item given the KG and the user’s prior interactions. The algorithm begins by initializing the representation of each item based on the user’s click history. Next, the algorithm iterates over items that are increasing hops from items the user had already interacted with. The end result is an embedding of the relevance of each item that is combined with the initial vector representation with a model for the final prediction of the likelihood of selecting that item. This was later extended by Wang et al. wang2021multitask by having a combined deep framework that is simultaneously trained on a KG embedding task in addition to learning the recommendation task. The model architecture features shared latent features between the two tasks, with the idea being that the inclusion of the embedding task will enhance the latent representations. We give an illustration of Ripp-MKR in Figure 9.

    Refer to caption
    Figure 9: Illustration of Ripp-MKR Feature Learning Mechanisms. The Ripp-MKR model involves a recommendation system KG with nodes representing users, movies, genres, and actors. In this KG, relationships such as “Alice watched Barbie of Swan Lake,” “Barbie of Swan Lake is starred by Barbie,” and “Barbie of Swan Lake has genre Animation” are examples of how the system is structured (see KG Construction). Taking Alice as the initial point, we construct a historical set, VAlicesubscript𝑉AliceV_{\text{Alice}}italic_V start_POSTSUBSCRIPT Alice end_POSTSUBSCRIPT, comprising Alice’s movie-watching history, which includes three movies (see RippleNet). RippleNet then extends Alice’s preference for the Barbie series to other movies with similar genres and actors, like “Barbie.” The KG Embedding Module (KGE) refines Alice’s embedding, uAlicesubscript𝑢Aliceu_{\text{Alice}}italic_u start_POSTSUBSCRIPT Alice end_POSTSUBSCRIPT, by aggregating all the k𝑘kitalic_k-hop softmax-weighted tail embeddings tisubscript𝑡𝑖t_{i}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for instance, “Animation” and “Barbie” (see KG Construction). This refined embedding, uAlicesubscript𝑢Aliceu_{\text{Alice}}italic_u start_POSTSUBSCRIPT Alice end_POSTSUBSCRIPT, is processed through an L𝐿Litalic_L-layer MLP to derive a nuanced user vector, UAliceLsuperscriptsubscript𝑈Alice𝐿U_{\text{Alice}}^{L}italic_U start_POSTSUBSCRIPT Alice end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT. The KGE is informed by the interactions among movies, genres, and actors. The Cross and Compress Unit (CCU) examines the interactions between different genres by calculating the outer product of the movie vector v𝑣vitalic_v (e.g., “Skyfall”) and an entity vector e𝑒eitalic_e from the set SSkyfallsubscript𝑆SkyfallS_{\text{Skyfall}}italic_S start_POSTSUBSCRIPT Skyfall end_POSTSUBSCRIPT, which includes entities related to “Skyfall” in the KG. After performing the outer product between v𝑣vitalic_v and each eSSkyfall𝑒subscript𝑆Skyfalle\in S_{\text{Skyfall}}italic_e ∈ italic_S start_POSTSUBSCRIPT Skyfall end_POSTSUBSCRIPT L𝐿Litalic_L times, the final latent feature vector, VSkyfallLsuperscriptsubscript𝑉Skyfall𝐿V_{\text{Skyfall}}^{L}italic_V start_POSTSUBSCRIPT Skyfall end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, for “Skyfall” is obtained by taking the expectation over the L outer products. The Recommendation Module then selects the movie with the highest sigmoid probability from the inner product of UAliceLsuperscriptsubscript𝑈Alice𝐿U_{\text{Alice}}^{L}italic_U start_POSTSUBSCRIPT Alice end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT and VSkyfallLsuperscriptsubscript𝑉Skyfall𝐿V_{\text{Skyfall}}^{L}italic_V start_POSTSUBSCRIPT Skyfall end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, denoted by y^Alice,Skyfallsubscript^𝑦Alice,Skyfall\widehat{y}_{\text{Alice,Skyfall}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT Alice,Skyfall end_POSTSUBSCRIPT. From potential next movies like “Skyfall,” “Inception,” and “Barbie: Fairytopia,” Ripp-MKR recommends “Barbie: Fairytopia” to Alice as it has the highest probability value, indicating it as the most suitable next watch.
  • Ismaeil et al. ismaeil2023feabi introduced a method, FeaBI, to generate interpretable KG entity embeddings. First, a standard KG embedding is calculated. Additionally, a few categories of features for each node are extracted to form a vector, including the types of edges or relations it has, the types of nodes it is connected to, sequences of edge types of a certain length, and graph structural statistics. Next, regression random forest models are trained to predict each of the original embedding dimensions from its extracted feature vector. The random forest model ranks features based on their importance for the reconstruction task. These rankings can be used to better understand the information captured by embeddings. Additionally, a smaller subset of the feature vector can be selected for the most important features and used in place of the original embedding for more interpretable analysis.

3.5 Deep Learning-Advanced Network representation Learning

Deep Learning models are designed to capture high-level, abstract representations of data. This ability allows them to capture meaningful insights from KGs, thereby enhancing applications in various domains, including personalized recommendations and predictive healthcare analytics.

  • Anelli et al. anelli2021sparse proposes KGFlex, a recommendation system (shani2011evaluating, ) that integrates KG-based feature selection to improve the personalization and accuracy of recommendations. They use the notion of multi-hop predicates (zhang2021cone, ) (i.e., considering chains of predicates that connect two entities at a high depth) to construct the semantic features on a KG. For instance, ABCABC\text{A}\rightarrow\text{B}\rightarrow\text{C}A → B → C is a 2-hop predicate.

    In the feature selection step, KGFlex utilizes the concepts of entropy and information gain (shannon1948mathematical, ; rokach2005top, ) to assess how significant and relevant a feature is to a user when determining whether to engage with an item or not, i.e., to watch a movie or not. The features, represented as \langlepredicate,entity\rangle pairs, are then embedded in a latent space to construct the user-item interaction along with user embeddings via DL methods. For a particular user, the items with higher user-item interactions are recommended. All the embeddings and model parameters in KGFlex are learned from the Bayesian Personalized Ranking (BPR) optimization criterion (rendle2012bpr, ). The whole procedure is visualized in Figure 10.

    Refer to caption
    Figure 10: Illustration of KGFlex feature selection and recommendation procedure. We start with a KG with 6 nodes and 6 predicates (edges/relations). A feature set \mathcal{F}caligraphic_F is constructed where each element is of form predicate, nodedelimited-⟨⟩predicate, node\langle\text{predicate, node}\rangle⟨ predicate, node ⟩; for instance, from node A we can get to node B via a black predicate, then a feature is constructed as \langleBlack, B\rangle. We construct a global embedding set 𝒢𝒢\mathcal{G}caligraphic_G representing each feature in \mathcal{F}caligraphic_F, and a user-feature embedding set 𝒫𝒫\mathcal{P}caligraphic_P for each pair of user and feature. All embeddings and parameters in KGFlex are learned via DL methods with the BPR optimization criterion. We then associate each user-feature pair with an information gain 𝕀𝔾𝕀𝔾\mathbb{IG}blackboard_I blackboard_G, which measures the expected reduction in information entropy from a prior node to a new node that acquires some information. For instance, suppose a user is currently at node A𝐴Aitalic_A. The computed information gain 𝕀𝔾𝕀𝔾\mathbb{IG}blackboard_I blackboard_G(\langleBlack, B\rangle)=1, 𝕀𝔾𝕀𝔾\mathbb{IG}blackboard_I blackboard_G(\langleOrange, E\rangle)=0 and 𝕀𝔾𝕀𝔾\mathbb{IG}blackboard_I blackboard_G(\langleBlack, D\rangle)=1 means the nodes B,D𝐵𝐷B,Ditalic_B , italic_D and the predicate “Black” have influential impacts on the user’s next move. Finally, for each user, we compute the user-item interaction 𝕏𝕏\mathbb{X}blackboard_X and recommend items to him with the highest 𝕏𝕏\mathbb{X}blackboard_X values.

    The performance of KGFlex is evaluated on three datasets from various domains, Yahoo! Movies, MovieLens, and Facebook Books. The experiments are designed to test the performance of KGFlex in terms of the Gini Index (gastwirth1972estimation, ; castells2021novelty, )). KGFlex outperforms certain latent factor models such as kaHFM (anelli2019make, ), Item-kNN (koren2010factor, ), NeuMF (he2017neural, ) and BPR-MF (rendle2012bpr, ) by an average of 18%. It also surpasses other key metrics, such as Item Coverage (adomavicius2011improving, ), in the recommendations it generates. Additionally, it excels in metrics like ACLT (abdollahpouri2019managing, ), PopREO, and PopRSP (zhu2020measuring, ), which measure recommendation performance concerning the underrepresentation of rare items. It is occasionally outperformed only by kaHFM in top-10 recommendations.

  • Su et al. su2022attention presents an attention-based KG representation learning framework, named DDKG, aimed at feature representation and selection to improve drug-drug interaction (DDI) prediction. This approach allows for end-to-end prediction of DDIs. We summarize the DDKG into the below four main parts:

    • a.

      KG Construction: The KG construction amalgamates the Simplified Molecular Input Line Entry System (SMILES), SMILES-associated triple facts, and entities such as proteins and diseases. For example, we have two drugs, A and B, and we integrate their SMILES sequences alongside their relationships (e.g., “targets”) with diseases into the KG.

    • b.

      Drug Embedding Initialization: DDKG uses an encoder-decoder layer to learn the initial embeddings of drug nodes, mainly from the SMILES sequences in the KG. This step transforms the SMILES sequences of drugs A and B into vector representations that capture their chemical structure and properties.

    • c.

      Drug Representation Learning: This part, consisting of three elements, serves as the key feature selection step in DDKG.

      • -

        Neighborhood Sampling: For each drug node, a fixed-size set of neighboring nodes is selected. The significance of each neighbor is determined by attention weights, which are calculated based on the embeddings of the nodes and the types of relationships among them. This step ensures only the most relevant neighbors (in terms of both graph structure and drug relationships) are considered for further computation.

      • -

        Information Propagation: It involves calculating a weighted sum of the neighbor embeddings. The attention weights (calculated in the previous step) are used to determine how much each neighbor’s information should contribute to the drug node’s new representation. This ensures that more relevant neighbors have a bigger impact on the final representation.

      • -

        Information Aggregation: The weighted sum of the neighbor embeddings is combined with the drug node’s initial embedding. A final global representation of a drug node is obtained.

    • d.

      DDI Prediction: For a queried pair of drugs, DDKG estimates their interaction probability by simply multiplying their final respective representations derived in c.

  • In the work by Hsieh et al. hsieh2021drug , a Graph Neural Networks (GNN) (zhou2020graph, ) is employed to advance the feature selection (drug selection) process for COVID-19 treatment from a drug-target interaction network (see Figure 11). The authors first constructed a COVID-19 KG (see the top-left region in Figure 11) and generated embeddings using a GNN. The method involves transferring knowledge from another drug repurposing KG (see top-right region) and learning high-dimensional embeddings for drugs that encapsulate the complex pharmacological characteristics of drugs (see middle region). By utilizing a ranking model informed by Bayesian pairwise ranking loss, this approach prioritizes potential drug candidates for downstream tasks such as gene set enrichment analysis (see middle-left region), Retrospective in vitro drug screening (see middle-right region), etc. Top 22 promising drugs including Aspirin, Acetaminophen and Teicoplanin are highlighted in the paper, demonstrating the rapid identification of candidate drugs for COVID‐19 treatment.

    Refer to caption
    Figure 11: Feature selection (drug selection) via GNN embedding and drug ranking. The authors first construct a COVID-19 KG containing different types of nodes (including 3635 drugs) and interactions. The variational graph autoencoder with GraphSAGE messages passing (kipf2016variational, ; hamilton2017inductive, ), a specific type of GNN, was used to derive the drug embedding (the grey squares in Feature Selection) by transferring a drug repurposing KG (DRKG) (zeng2020repurpose, ) to boost the representativeness. Initial drug ranking using Bayesian pairwise ranking loss is applied to rank and select possibly potent drugs out of all candidates, hence serving as a feature selection step. The model efficacy was demonstrated using different validations. For instance, the authors perform Genetic Validation by identifying significant associations between SARS-CoV-2 and selected drugs. Drug Screening Validation is also performed by retrospectively comparing selected drugs with effective drugs in various in vitro drug screening experiments. In the Population-based Validation, the proposed method identified six drugs administered to the COVID-19 patients out of ten positive drugs that were effective in the electronic health records. In addition, Drug Combination Search for improving the COVID-19 treatment efficacy is conducted on the selected drugs. All validation results testify the capability of the proposed method speeding up the discovery of candidate drugs for treating COVID-19.

3.6 Comparative Analysis of Different Approaches

We evaluate the methodologies from referenced manuscripts, focusing on their advantages and disadvantages.

1. Search Algorithms Used in the Hadith Corpus KG mohammed2020edge with the ACO algorithm and in COPD diagnosis fang2019diagnosis with the CMFS-η𝜂\etaitalic_η algorithm. These methods highlight the importance of selecting appropriate strategies based on specific dataset requirements.

2. Vector Embeddings This approach, exemplified by the DistMult Algorithm and FeaBI, moves away from explicit path searches to embedding entities in a continuous vector space. It captures deep semantic relationships, facilitating the identification of intricate patterns relevant to complex domains like drug discovery dorigo2019ant ; atef2022feature .

3. Similarity-based Methods These methods compare entities within a graph to identify similarities using metrics like cosine similarity or Jaccard index. They are beneficial for clustering or recommendation systems, as demonstrated by Ma et al. ma2020knowledge in Android malware classification and Jaworsky et al. jaworsky2023interrelated in health survey datasets.

4. Advanced Network Representation Learning Utilizes deep learning models to interpret and analyze KGs, capturing high-level data representations. Examples include KGFLEX for optimizing recommendation systems and DDKG for drug-drug interaction predictions, showcasing the power of GNN frameworks in feature selection fang2019diagnosis .

Comparison and Contrast Search algorithms and similarity-based methods provide direct, interpretable insights into KG structures, making them suitable for applications requiring clarity and precision. In contrast, vector embeddings and advanced network representation learning offer a nuanced understanding of data, identifying complex patterns and relationships. These latter methods are valuable for scenarios where data relationships are not straightforward, enabling flexible and powerful KG modeling for predictive analytics. The drug ranking technique by Hsieh et al. hsieh2021drug demonstrates the intersection of vector embeddings and advanced network learning, highlighting their transformative potential in feature selection.

Table 1: Comparison of Feature Selection Methods for KGs
Method Pros Cons
Search Algorithms Efficient and precise in known domains. Straightforward implementation. May miss novel connections. Less adaptive to new patterns.
Vector Embeddings Captures deep semantic relationships. Scalable to large KGs. Enhances predictive power. Challenges in interpretability. High initial training cost.
Similarity-based Methods Easy to understand. Efficient for clustering/recommendations. Reliant on similarity metric quality. Computational challenges with large KGs.
Advanced Network Representation Learning Learns complex representations. Integrates heterogeneous data. Versatile in application. Computationally intensive. Complex model structure.

4 Challenges and Opportunities in Knowledge Graph Feature Selection

Knowledge Graphs (KGs) are transforming data-driven fields like biomedical research, bioinformatics, and recommendation systems. They offer significant analytical capabilities but also present challenges and opportunities, especially in feature selection for machine learning models.

4.1 Challenges

Feature selection in KGs faces several hurdles:

  1. 1.

    High Dimensionality and Complexity: KGs encompass numerous entities and relationships, creating high-dimensional spaces that challenge traditional feature selection methods.

  2. 2.

    Data Heterogeneity: KGs integrate diverse data types (numerical, categorical, textual) from various sources, necessitating robust feature selection techniques.

  3. 3.

    Interpretability: Enhancing interpretability is crucial, especially in fields like healthcare, where understanding why features are selected is essential.

4.2 Future Directions

Several promising research avenues could redefine KG feature selection:

  • Causal Inference Techniques: Applying causal inference techniques to KGs can refine feature selection strategies (malec2023causal, ).

  • Embedding KGs into Feature Matrices: Creating feature matrices from KGs facilitates downstream tasks and enhances model performance (strande2017evaluating, ).

  • Novel Algorithms: Exploring algorithms like Ant Colony Optimization (ACO) introduces new approaches to feature selection within KGs (dorigo2019ant, ; atef2022feature, ).

  • Multi-objective Optimization: Using multi-objective optimization techniques offers a refined methodology for feature selection, balancing criteria like redundancy and relevance (mouret2015illuminating, ).

  • Interdisciplinary Integration: Combining KGs with quantum computing, reinforcement learning (RL), and federated learning (FL) can enhance feature selection. Quantum-enhanced selection addresses scalability, RL refines the process based on feedback, and FL enables decentralized selection, preserving privacy ma2021quantum ; huang2022fedcke .

  • Semantic Enrichment and XAI: Leveraging the semantic information in KGs and applying Explainable AI principles can improve feature selection and model interpretability.

  • Domain Knowledge Integration: Integrating domain-specific knowledge into the feature selection process results in more effective selections, particularly in specialized fields like genomics and pharmacology.

  • Multi-modal Data Fusion: Combining various data sources into KGs offers a holistic view and unlocks new insights and applications.

  • Dynamic KGs and Real-time Feature Selection: Developing methods for real-time feature selection as KGs evolve can lead to more agile models, critical in rapidly changing domains like social media analysis.

  • Collaborative KG Frameworks: Creating frameworks for sharing and integrating KGs can enhance feature diversity and quality, fostering standardized protocols and benchmarks.

  • Ethical Considerations: Prioritizing ethical considerations and bias mitigation in KG feature selection ensures fairness and equity in applications.

5 Conclusion

Examining these methodologies underscores the importance of scalability, accuracy, and interpretability in feature selection processes. As KGs grow, developing scalable algorithms that efficiently process large-scale KGs without losing information granularity is paramount. This requires a balanced approach that leverages KGs’ rich semantic relationships while addressing computational challenges.

Key Points of the Paper

  • Emphasizes combining feature selection techniques with KGs to enhance predictive modeling in biomedical research.

  • Shows significant applications in bioinformatics, improving disease prediction and drug discovery processes.

  • Discusses challenges like computational complexity and the need for comprehensive KGs, proposing future research to develop efficient algorithms and integrate additional data sources.

6 Funding

This work was funded by the National Institutes of Health (NIH) [U01 AG066833].

Appendix

6.1 Appendix A. Table of Acronyms

Table 2 lists the Table of Acronyms for this paper.

Abbreviation Definition
ACLT Average Coverage of Long Tail items
ACO Ant Colony Optimization
AD Alzheimer’s Disease
ADKG Alzheimer’s Disease Knowledge Graph
AI Artificial Intelligence
AlzKb Alzheimer’s Disease Knowledge Base
APOE Apolipoprotein E
AUC Area Under the Curve
Bi-LSTM Bidirectional Long Short-Term Memory
BPR Bayesian Personalized Ranking
COPD Chronic Obstructive Pulmonary Disease
CYP2D6 Cytochrome P450 2D6
DDI drug-drug interaction
DistMult The Distributed Multinomial Method
DL Deep Learning
DR Dimensionality/Dimension Reduction
DSA-SVM Direct Search Simulated Annealing with Support Vector Machine
DTP Drug-target Pairs
GDB Graph Database
GNN Graph Neural Network
HMOX1 Heme Oxygenase 1
KEGG Kyoto Encyclopedia of Genes and Genomes
KG Knowledge Graph
LDA Linear Discriminant Analysis
LLE Local Linear Embedding
ML Machine Learning
MLP Multiple Layer Perceptron
MQL Metaweb Query Language
MTHFR Methylenetetrahydrofolate Reductase
RDF Resource Description Framework
RFE Recursive Feature Elimination
nDCG Normalized Discount Cumulative Gain
NLP Natural Language Processing
NOS3 Nitric Oxide Synthase 3
OWL The Web Ontology Language
PCA Principal Component Analysis
PPARG Peroxisome Proliferator-Activated Receptor Gamma
RDF Resource Description Framework
RFE Recursive Feature Elimination
RO Relation Ontology
TPI1 Triosephosphate Isomerase 1
URIs Uniform Resource Identifiers
UMLS Unified Medical Language System
W3C World Wide Web Consortium
YAGO Yet Another Great Ontology
Table 2: Table of Acronyms

6.2 Appendix B. A more detailed description of KGs of sizes tiny, small, and medium

Within each of the three graphs(see Figure 2, Figure 12, and Figure 13), the nodes and their connections are represented by distinct colors and arrow types to convey different biological relationships:

Orange (see Figure 12 and Figure 13) and Yellow (see Figure 2) nodes represent the disease entity, with AD positioned as the central node, highlighting it as the primary focus of this network.

Purple nodes signify genes, which are implicated in AD through various associations such as genetic risk factors, differential gene expression, or other genetic interactions.

Green nodes denote chemicals, encompassing drugs, vitamins, or other bioactive molecules. These external agents are potential modulators of gene function or disease pathology.

  • There are five instances of the “Chemical binds gene” relationship (light purple arrows in Figure 2 and coffee arrows in Figure 12 and Figure 13), where a chemical is shown to interact directly with a gene. This does not necessarily indicate an increase or decrease in gene expression, but rather a physical or functional interaction. For example, one of the edges indicates folic acid, a form of vitamin B that is vital for making DNA and other genetic material, binds the MTHFR gene. MTHFR plays a crucial role in processing amino acids, the building blocks of proteins. Variants of this gene can affect homocysteine levels in the blood. Deficiencies in folic acid are linked to elevated homocysteine levels, which may increase AD risk.

  • There are six instances of the “Gene associates with disease” relationship (yellow arrows in Figure 2 and red arrows in Figure 12 and Figure 13), representing genes that have an association with AD. These relationships might represent genetic risk factors, genes involved in the pathology of the disease, or genes that could be potential targets for therapeutic intervention. For instance, the NOS3 gene is associated with AD. It is involved in the generation of nitric oxide, a molecule that aids in blood vessel dilation. Impairment in NOS3 function can affect blood flow in the brain, potentially impacting Alzheimer’s disease pathology.

  • There are three instances of the “Chemical increases expression” relationship (pink arrows), denoting chemicals that are known to upregulate or increase the expression of certain genes. For instance, Vitamin A increases the expression of HMOX1, a gene-encoding enzyme in response to oxidative stress, which is a contributing factor in neuronal damage observed in AD.

  • There is one instance of the “Chemical decreases expression” relationship (green arrow), indicating a chemical that downregulates or decreases the expression of a gene. Namely, Cyclosporine, an immunosuppressant that may inhibit the formation of amyloid plaques, a hallmark of AD, decreases the expression of TPI1, an enzyme that plays a crucial role in glycolysis, a metabolic pathway that occurs in the cytoplasm of cells.

  • There is one instance of “Gene regulates gene” (purple arrow), suggesting a regulatory interaction between two genes, PPARG and TPI1. For context, PPARG is a gene that codes for a protein that regulates fatty acid storage and glucose metabolism. It is a target for some drugs that might influence Alzheimer’s disease progression.

Figure 12 provides an example of small-sized ADKG with 23 nodes and 32 edges (setting the Cypher limit clause to 15) and figure 13 provides an example of medium-sized ADKG with 156 nodes and 288 edges (setting the Cypher limit clause to 200). In addition to the relationship types described above, the medium-sized ADKG also demonstrates the “DRUGTREATDISEASE” (gold arrows) and “GENEINTERACTSWITHGENE” (brown arrows) relationships. As the size of KGs continues to expand, the challenge of comprehending the intricate web of entities and relationships within them becomes daunting for human observers. Consequently, there arises an urgent need for the development of sophisticated computational tools capable of effectively managing these vast KGs.

Refer to caption
Refer to caption
Figure 12: A Small-sized ADKG (Orange Node: AD; Purple Nodes: Genes; Green Nodes: Drugs) alzheimersknowledgebase
Refer to caption
Refer to caption
Figure 13: A Medium-sized ADKG (Orange Node: AD; Purple Nodes: Genes; Green Nodes: Drugs) alzheimersknowledgebase

References

  • [1] Janneth Chicaiza and Priscila Valdiviezo-Diaz. A comprehensive survey of knowledge graph-based recommender systems: Technologies, development, and contributions. Information, 12(6):232, 2021.
  • [2] François Belleau, Marc-Alexandre Nolin, Nicole Tourigny, Philippe Rigault, and Jean Morissette. Bio2rdf: towards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics, 41(5):706–716, 2008.
  • [3] SM Shamimul Hasan, Donna Rivera, Xiao-Cheng Wu, Eric B Durbin, J Blair Christian, and Georgia Tourassi. Knowledge graph-enabled cancer data analytics. IEEE journal of biomedical and health informatics, 24(7):1952–1967, 2020.
  • [4] Dieter Fensel, Umutcan Şimşek, Kevin Angele, Elwin Huaman, Elias Kärle, Oleksandra Panasiuk, Ioan Toma, Jürgen Umbrich, Alexander Wahler, Dieter Fensel, et al. Introduction: what is a knowledge graph? Knowledge graphs: Methodology, tools and selected use cases, pages 1–10, 2020.
  • [5] Stephen Bonner, Ian P Barrett, Cheng Ye, Rowan Swiers, Ola Engkvist, Charles Tapley Hoyt, and William L Hamilton. Understanding the performance of knowledge graph embeddings in drug discovery. Artificial Intelligence in the Life Sciences, 2:100036, 2022.
  • [6] Yang Yang, Yuwei Lu, and Wenying Yan. A comprehensive review on knowledge graphs for complex diseases. Briefings in Bioinformatics, 24(1):bbac543, 2023.
  • [7] Beth Levine and Guido Kroemer. Biological functions of autophagy genes: a disease perspective. Cell, 176(1):11–42, 2019.
  • [8] Payal Chandak, Kexin Huang, and Marinka Zitnik. Building a knowledge graph to enable precision medicine. Scientific Data, 10(1):67, 2023.
  • [9] Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.
  • [10] Artur J Ferreira and Mário AT Figueiredo. Efficient feature selection filters for high-dimensional data. Pattern recognition letters, 33(13):1794–1804, 2012.
  • [11] Salim Lahmiri. Features selection, data mining and finacial risk classification: a comparative study. Intelligent Systems in Accounting, Finance and Management, 23(4):265–275, 2016.
  • [12] Shamsul Huda, John Yearwood, Herbert F Jelinek, Mohammad Mehedi Hassan, Giancarlo Fortino, and Michael Buckland. A hybrid feature selection with ensemble classification for imbalanced healthcare data: A case study for brain tumor diagnosis. IEEE access, 4:9145–9154, 2016.
  • [13] Malcolm R Forster. Key concepts in model selection: Performance and generalizability. Journal of mathematical psychology, 44(1):205–231, 2000.
  • [14] Pasi Saari, Tuomas Eerola, and Olivier Lartillot. Generalizability and simplicity as criteria in feature selection: Application to mood classification in music. IEEE Transactions on audio, speech, and language processing, 19(6):1802–1812, 2010.
  • [15] Thulasi Bikku, N Sambasiva Rao, and Ananda Rao Akepogu. Hadoop based feature selection and decision making models on big data. Indian Journal of Science and Technology, 9(10):1–6, 2016.
  • [16] Bushra Mohammed, Mosab Hamdan, Joseph Stephen Bassi, Haitham A Jamil, Suleman Khan, Abdallah Elhigazi, Danda B Rawat, Ismahani Binti Ismail, and Muhammad Nadzir Marsono. Edge computing intelligence using robust feature selection for network traffic classification in internet-of-things. IEEE Access, 8:224059–224070, 2020.
  • [17] Alan Jović, Karla Brkić, and Nikola Bogunović. A review of feature selection methods with applications. In 2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO), pages 1200–1205. Ieee, 2015.
  • [18] Steffen Staab and Rudi Studer. Handbook on ontologies. Springer Science & Business Media, 2010.
  • [19] David S Wishart, Yannick D Feunang, An C Guo, Elvis J Lo, Ana Marcu, Jason R Grant, Tanvir Sajed, Daniel Johnson, Carin Li, Zinat Sayeeda, et al. Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research, 46(D1):D1074–D1082, 2018.
  • [20] Michael Kuhn, Ivica Letunic, Lars Juhl Jensen, and Peer Bork. The sider database of drugs and side effects. Nucleic acids research, 44(D1):D1075–D1079, 2016.
  • [21] Minoru Kanehisa and Susumu Goto. Kegg: kyoto encyclopedia of genes and genomes. Nucleic acids research, 28(1):27–30, 2000.
  • [22] Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270, 2004.
  • [23] Kevin Donnelly et al. Snomed-ct: The advanced terminology and coding system for ehealth. Studies in health technology and informatics, 121:279, 2006.
  • [24] Stuart J Nelson, Kelly Zeng, John Kilbourne, Tammy Powell, and Robin Moore. Normalized names for clinical drugs: Rxnorm at 6 years. Journal of the American Medical Informatics Association, 18(4):441–448, 2011.
  • [25] Justin J Miller. Graph database applications and concepts with neo4j. In Proceedings of the southern association for information systems conference, Atlanta, GA, USA, volume 2324, pages 141–147, 2013.
  • [26] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250, 2008.
  • [27] Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. Cypher: An evolving query language for property graphs. In Proceedings of the 2018 international conference on management of data, pages 1433–1445, 2018.
  • [28] Joseph D. Romano, Van Truong, Rachit Kumar, Mythreye Venkatesan, Britney E. Graham, Yun Hao, Nick Matsumoto, Xi Li, Zhiping Wang, Marylyn Ritchie, Li Shen, and Jason H. Moore. The alzheimer’s knowledge base - a knowledge graph for therapeutic discovery in alzheimer’s disease research. Preprint, February 2023.
  • [29] Alessio Zanga, Elif Ozkirimli, and Fabio Stella. A survey on causal discovery: Theory and practice. International Journal of Approximate Reasoning, 151:101–129, 2022.
  • [30] Scott A Malec, Sanya B Taneja, Steven M Albert, C Elizabeth Shaaban, Helmet T Karim, Arthur S Levine, Paul Munro, Tiffany J Callahan, and Richard D Boyce. Causal feature selection using a knowledge graph combining structured knowledge from the biomedical literature and ontologies: a use case studying depression as a risk factor for alzheimer’s disease. Journal of Biomedical Informatics, 142:104368, 2023.
  • [31] Terry Callahan. Pheknowlator. https://doi.org/10.5281/zenodo.3401437, 2019. Available online at Zenodo.
  • [32] Benjamin M Gyori, John A Bachman, Kartik Subramanian, Jeremy L Muhlich, Lucian Galescu, and Peter K Sorger. From word models to executable models of signaling networks using automated assembly. Molecular systems biology, 13(11):954, 2017.
  • [33] Jorge Pérez, Marcelo Arenas, and Claudio Gutierrez. Semantics and complexity of sparql. ACM Transactions on Database Systems (TODS), 34(3):1–45, 2009.
  • [34] Ian Horrocks, Peter F Patel-Schneider, Sean Bechhofer, and Dmitry Tsarkov. Owl rules: A proposal and prototype implementation. Journal of web semantics, 3(1):23–40, 2005.
  • [35] Bob DuCharme. Learning SPARQL: querying and updating with SPARQL 1.1. ” O’Reilly Media, Inc.”, 2013.
  • [36] Youli Fang, Hong Wang, Lutong Wang, Ruitong Di, and Yongqiang Song. Diagnosis of copd based on a knowledge graph and integrated model. IEEE Access, 7:46004–46013, 2019.
  • [37] Duoyuan Ma, Yude Bai, Zhenchang Xing, Lintan Sun, and Xiaohong Li. A knowledge graph-based sensitive feature selection for android malware classification. 2020 27th Asia-Pacific Software Engineering Conference (APSEC), pages 188–197, 2020.
  • [38] Markian Jaworsky, Xiaohui Tao, Lei Pan, Shiva Raj Pokhrel, Jianming Yong, and Ji Zhang. Interrelated feature selection from health surveys using domain knowledge graph. Health Information Science and Systems, 11(1):54, 2023.
  • [39] Carol Pierannunzi, Shaohua Sean Hu, and Lina Balluz. A systematic review of publications assessing reliability and validity of the behavioral risk factor surveillance system (brfss), 2004–2011. BMC medical research methodology, 13(1):1–14, 2013.
  • [40] Mohamed Atef Mosa. Feature selection based on aco and knowledge graph for arabic text classification. Journal of Experimental & Theoretical Artificial Intelligence, pages 1–18, 2022.
  • [41] Marco Dorigo and Christian Blum. Ant colony optimization theory: A survey. Theoretical computer science, 344(2-3):243–278, 2005.
  • [42] Marco Dorigo, Mauro Birattari, and Thomas Stutzle. Ant colony optimization. IEEE computational intelligence magazine, 1(4):28–39, 2006.
  • [43] Marco Dorigo and Thomas Stützle. Ant colony optimization: overview and recent advances. Springer, 2019.
  • [44] Rafael S Parpinelli, Heitor S Lopes, and Alex Alves Freitas. Data mining with an ant colony optimization algorithm. IEEE transactions on evolutionary computation, 6(4):321–332, 2002.
  • [45] David Martens, Manu De Backer, Raf Haesen, Jan Vanthienen, Monique Snoeck, and Bart Baesens. Classification with ant colony optimization. IEEE transactions on evolutionary computation, 11(5):651–665, 2007.
  • [46] Mehdi Hosseinzadeh Aghdam, Nasser Ghasem-Aghaee, and Mohammad Ehsan Basiri. Text feature selection using ant colony optimization. Expert systems with applications, 36(3):6843–6853, 2009.
  • [47] Aytuğ Onan. Srl-aco: A text augmentation framework based on semantic role labeling and ant colony optimization. Journal of King Saud University-Computer and Information Sciences, page 101611, 2023.
  • [48] Hyejung Chang. Making sense of the big picture: Data linkage and integration in the era of big data. Healthcare Informatics Research, 24(4):251, 2018.
  • [49] Li Li, Haolin Yang, Yueming Jiao, and Kuo-Yi Lin. Feature generation based on knowledge graph. IFAC-PapersOnLine, 53(5):774–779, 2020.
  • [50] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data. In international semantic web conference, pages 722–735. Springer, 2007.
  • [51] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014.
  • [52] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International conference on machine learning, pages 2071–2080. PMLR, 2016.
  • [53] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. Advances in neural information processing systems, 26, 2013.
  • [54] Maximilian Nickel, Volker Tresp, Hans-Peter Kriegel, et al. A three-way model for collective learning on multi-relational data. In Icml, volume 11, pages 3104482–3104584, 2011.
  • [55] Youmna Ismaeil, Daria Stepanova, Trung-Kien Tran, and Hendrik Blockeel. FeaBI: A Feature Selection-Based Framework for Interpreting KG Embeddings, pages 599–617. 2023.
  • [56] Hongwei Wang, Fuzheng Zhang, Jialin Wang, Miao Zhao, Wenjie Li, Xing Xie, and Minyi Guo. Ripplenet: Propagating user preferences on the knowledge graph for recommender systems. In Proceedings of the 27th ACM international conference on information and knowledge management, pages 417–426, 2018.
  • [57] Ajay Kumar, Shashank Sheshar Singh, Kuldeep Singh, and Bhaskar Biswas. Link prediction techniques, applications, and performance: A survey. Physica A: Statistical Mechanics and its Applications, 553:124289, 2020.
  • [58] Tareq Al-Moslmi, Marc Gallofré Ocaña, Andreas L Opdahl, and Csaba Veres. Named entity extraction for knowledge graphs: A literature overview. IEEE Access, 8:32862–32881, 2020.
  • [59] Zhe Chen, Yuehan Wang, Bin Zhao, Jing Cheng, Xin Zhao, and Zongtao Duan. Knowledge graph completion: A review. Ieee Access, 8:192435–192456, 2020.
  • [60] Shudong Wang, Zhenzhen Du, Mao Ding, Alfonso Rodriguez-Paton, and Tao Song. Kg-dti: a knowledge graph based deep learning method for drug-target interaction predictions and alzheimer’s disease drug repositions. Applied Intelligence, 52(1):846–857, 2022.
  • [61] YueQun Wang, LiYan Dong, YongLi Li, and Hao Zhang. Multitask feature learning approach for knowledge graph enhanced recommendations with ripplenet. Plos one, 16(5):e0251162, 2021.
  • [62] Vito Walter Anelli, Tommaso Di Noia, Eugenio Di Sciascio, Antonio Ferrara, and Alberto Carlo Maria Mancino. Sparse feature factorization for recommender systems with knowledge graphs, pages 154–165. 2021.
  • [63] Guy Shani and Asela Gunawardana. Evaluating recommendation systems. Recommender systems handbook, pages 257–297, 2011.
  • [64] Zhanqiu Zhang, Jie Wang, Jiajun Chen, Shuiwang Ji, and Feng Wu. Cone: Cone embeddings for multi-hop reasoning over knowledge graphs. Advances in Neural Information Processing Systems, 34:19172–19183, 2021.
  • [65] Claude Elwood Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
  • [66] Lior Rokach and Oded Maimon. Top-down induction of decision trees classifiers-a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 35(4):476–487, 2005.
  • [67] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618, 2012.
  • [68] Joseph L Gastwirth. The estimation of the lorenz curve and gini index. The review of economics and statistics, pages 306–316, 1972.
  • [69] Pablo Castells, Neil Hurley, and Saul Vargas. Novelty and diversity in recommender systems. In Recommender systems handbook, pages 603–646. Springer, 2021.
  • [70] Vito Walter Anelli, Tommaso Di Noia, Eugenio Di Sciascio, Azzurra Ragone, and Joseph Trotta. How to make latent factors interpretable by feeding factorization machines with knowledge graphs. In The Semantic Web–ISWC 2019: 18th International Semantic Web Conference, Auckland, New Zealand, October 26–30, 2019, Proceedings, Part I 18, pages 38–56. Springer, 2019.
  • [71] Yehuda Koren. Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions on Knowledge Discovery from Data (TKDD), 4(1):1–24, 2010.
  • [72] Xiangnan He and Tat-Seng Chua. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 355–364, 2017.
  • [73] Gediminas Adomavicius and YoungOk Kwon. Improving aggregate recommendation diversity using ranking-based techniques. IEEE Transactions on Knowledge and Data Engineering, 24(5):896–911, 2011.
  • [74] Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. Managing popularity bias in recommender systems with personalized re-ranking. arXiv preprint arXiv:1901.07555, 2019.
  • [75] Ziwei Zhu, Jianling Wang, and James Caverlee. Measuring and mitigating item under-recommendation bias in personalized ranking systems. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pages 449–458, 2020.
  • [76] Xiaorui Su, Lun Hu, Zhuhong You, Pengwei Hu, and Bowei Zhao. Attention-based knowledge graph representation learning for predicting drug-drug interactions. Briefings in bioinformatics, 23(3):bbac140, 2022.
  • [77] Kanglin Hsieh, Yinyin Wang, Luyao Chen, Zhongming Zhao, Sean Savitz, Xiaoqian Jiang, Jing Tang, and Yejin Kim. Drug repurposing for covid-19 using graph neural network and harmonizing multiple evidence. Scientific reports, 11(1):23179, 2021.
  • [78] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI open, 1:57–81, 2020.
  • [79] Thomas N Kipf and Max Welling. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308, 2016.
  • [80] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
  • [81] Xiangxiang Zeng, Xiang Song, Tengfei Ma, Xiaoqin Pan, Yadi Zhou, Yuan Hou, Zheng Zhang, Kenli Li, George Karypis, and Feixiong Cheng. Repurpose open data to discover therapeutics for covid-19 using deep learning. Journal of proteome research, 19(11):4624–4636, 2020.
  • [82] Natasha T Strande, Erin Rooney Riggs, Adam H Buchanan, Ozge Ceyhan-Birsoy, Marina DiStefano, Selina S Dwight, Jenny Goldstein, Rajarshi Ghosh, Bryce A Seifert, Tam P Sneddon, et al. Evaluating the clinical validity of gene-disease associations: an evidence-based framework developed by the clinical genome resource. The American Journal of Human Genetics, 100(6):895–906, 2017.
  • [83] Jean-Baptiste Mouret and Jeff Clune. Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909, 2015.
  • [84] Yunpu Ma and Volker Tresp. Quantum machine learning algorithm for knowledge graphs. ACM Transactions on Quantum Computing, 2(3):1–28, 2021.
  • [85] Wei Huang, Jia Liu, Tianrui Li, Shenggong Ji, Dexian Wang, and Tianqiang Huang. Fedcke: Cross-domain knowledge graph embedding in federated learning. IEEE Transactions on Big Data, 2022.
{biography}

Sisi Shao is currently a PhD student at the Department of Biostatistics, UCLA Fielding School of Public Health. She received a M.Sc. degree in Financial Engineering from UCLA Anderson School of Management. Her research interest includes artificial intelligence, automated machine learning methods, and statistical methods for large-scale multivariate time series and longitudinal data.

{biography}

Pedro Henrique Ribeiro received a M.S.E. in Bioengineering from the University of Pennsylvania and a B.A. in Computer Science from Oberlin College. He is currently a research data scientist at the Cedars-Sinai Department of Computational Biomedicine. His main research interests are in machine learning and evolutionary algorithms.

{biography}

Christina Ramirez received her Ph.D. degree in Statistics/Social Science from the California Institute of Technology, Pasadena, CA, USA, in 1999. She is currently a Professor of Biostatistics at the UCLA Fielding School of Public Health. Her main research interests include HIV pathogenesis, HIV drug resistance mutation/recombination, viral fitness, coreceptor utilization, and high-dimensional data analysis.

{biography}

Jason H. Moore is Chair of the Department of Computational Biomedicine at Cedars-Sinai Medical Center where he also serves as Director of the Center for Artificial Intelligence Research and Education. His research focuses on the development and application of artificial intelligence and machine learning methods for the analysis of biomedical and clinical data with the goal of improving health.