AI agents · Knowledge graphs · Ontology · GraphRAG

After the knowledge graph: ontology-grounded LLMs, GraphRAG, and statistical discovery

27 May 2026 Damien Huzard, PhD

A knowledge graph is not an endpoint. Its scientific value begins when it becomes traceable, semantically constrained, queryable, computationally exploitable, and evaluated — and when LLMs, embeddings, and human reviewers each operate inside the layer that actually fits them.

1. Four objects, often conflated

A previous note argued for structure first and generation last — that scientific LLM workflows need metadata, ontologies, and graphs rather than longer context windows. Once such a graph exists, a different question opens: what to do with it. The first step is terminological. A graph database is a storage and query engine — Neo4j, Memgraph, Kùzu, Apache AGE on the property-graph side, and Apache Jena Fuseki or Oxigraph on the RDF side. A knowledge graph is a graph whose nodes and edges represent identifiable entities and assertions about a domain, ideally with provenance and controlled semantics. An ontology is the formal semantic specification: classes, relations, hierarchies, constraints — RDF/OWL/SKOS, with SHACL for executable validation. A GraphRAG index is an application-oriented retrieval structure built from a corpus to support LLM question answering, containing entities, relations, vector indexes, and community summaries [1]. It is not automatically a validated ontology.

This distinction matters because an LLM can rapidly produce a graph, but the resulting graph may contain duplicated entities, invalid relation types, unsupported assertions, or misleading summaries. The scale issue is concrete: a 2025 Nature Communications study reported more than ten million LLM-extracted mental-health relations, including many novel associations not present in existing resources — at that scale, provenance and validation are not optional [15]. DR.KNOWS adds the path-quality lesson: KG paths can improve diagnostic prediction with LLMs, but irrelevant or contradictory paths impair model performance [2]. Treat LLM-generated graphs as candidate-proposal artefacts, not as canonical knowledge [3].

2. Separate evidence, ontology, and application into three layers

A robust implementation separates the canonical evidence layer (what is directly supported by source documents and datasets, with passage-level provenance), the ontological layer (what concepts and relations mean, validated by SHACL), and the application/model layer (GraphRAG summaries, vector embeddings, link predictions used for search and inference). The canonical graph should store only accepted or explicitly status-labelled assertions; LLM-generated relations, GraphRAG summaries, and predicted links should remain distinguishable through a status field (candidate | validated | rejected | disputed) and an extraction-method field [3].

A minimal assertion model that supports this discipline includes subject, predicate, object, source document (DOI/PMID/internal ID), source passage, extraction method, model version, confidence, validation status, curator, timestamp, and an evidence grade. This is the same provenance pattern Neuronautix already recommends through RO-Crate, FAIRSCAPE, and BioCompute Objects for preclinical and NAM data — the KG simply inherits it.

3. GraphRAG, calibrated

Microsoft GraphRAG extracts entities and relations from documents, builds a graph, detects communities, and generates community summaries to support reasoning across distributed information; in its original evaluation it improved comprehensiveness and diversity for corpus-wide sensemaking questions compared with naïve vector RAG [4]. LightRAG adds dual-level graph plus vector retrieval with incremental updates [5]. HippoRAG models long-term memory with Personalized PageRank over a KG-like index for multi-hop associative retrieval [6]. OG-RAG retrieves through ontology-grounded structures rather than text chunks and reports improved correctness across several LLMs [7].

The calibration matters. Han et al. report that conventional RAG and GraphRAG are complementary: vanilla RAG tends to win for single-hop factual retrieval, while GraphRAG is more useful for multi-hop reasoning and broader contextual synthesis [8]. A more recent benchmark-oriented analysis finds that graph-based retrieval may underperform vanilla RAG on some real-world tasks and that graph use should be empirically justified rather than assumed beneficial [9]. The practical conclusion is direct: when a validated KG already exists, do not let GraphRAG recreate the canonical graph from scratch — use it as a retrieval and summarisation layer over validated entities, relations, and evidence.

4. The right database depends on the question, not the trend

There is no universally correct graph database. The appropriate choice depends on whether the graph is primarily an interoperable semantic resource, an application retrieval engine, an embedded analytical component, or a real-time operational system. Neo4j with neo4j-graphrag-python is mature for interactive applications, hybrid retrieval, and graph-assisted LLM systems; the neosemantics plugin adds RDF import/export and SHACL validation while underlying storage remains property-graph oriented [14]. Apache Jena Fuseki and Oxigraph are standards-native for RDF/SPARQL and appropriate for canonical scientific ontologies and FAIR linked-data layers.

For a formal scientific ontology or metadata platform — the pattern we apply to NAMO, HCMO, and MBO work — the canonical representation belongs in RDF/JSON-LD with OWL/SKOS semantics and SHACL validation, stored in Jena Fuseki or Oxigraph, processed with RDFLib and pySHACL, and projected into Neo4j for graphical navigation and GraphRAG experiments. For discovery-oriented technical corpora, the inverse is reasonable: Neo4j as the principal operational graph, GraphRAG and LightRAG tested against a vector baseline, with a minimal ontology from day one and RDF/SHACL export added once the schema stabilises.

5. Using the ontology with LLMs — and with statistics

An ontology can constrain an LLM at extraction time by defining admissible classes, permitted relation types, mandatory provenance fields, cardinality rules, synonym mappings, and forbidden ambiguous relations. This converts the LLM from an autonomous graph generator into a candidate-proposal engine operating under semantic constraints. At retrieval time, the ontology can replace embedding-only similarity with graph expansion: a query about housing variables that could bias activity-based control-group comparisons resolves through typed paths from ActivityEndpoint back to MonitoringSystem, EnvironmentalFactor, and SourcePublication, producing a restricted, traceable evidence package rather than a bag of similar text fragments [7].

LLMs are not the only — or always the best — way to exploit a KG. Knowledge graph embeddings (TransE, DistMult, ComplEx, RotatE; via PyKEEN or DGL-KE [13]) and graph neural networks support link prediction, clustering, anomaly detection, entity resolution, and hypothesis prioritisation. A predicted missing edge is a ranked hypothesis, not a validated fact — its proper consumer is a literature review or a follow-up experiment, not the canonical evidence graph. FuseLinker, published in the Journal of Biomedical Informatics in 2024, shows that combining LLM-derived textual representations with graph neural networks and link prediction outperforms either modality alone for biomedical KG completion [10]. LORE applies the same discipline at the assertion level: literature-derived disease–gene information is represented as verifiable factual statements before semantic representations drive prediction [11]. For ontology evolution, the robust pattern is candidate mapping from the LLM, structural filtering by the ontology, embedding-based ranking, and human validation of high-impact mappings — supported by tools such as DeepOnto, LLMs4OM, OWL2Vec-Star, and OntoAligner [12].

6. Four experiments before any custom model

The best next step after KG construction is generally not a custom embedding architecture or a fine-tuned model — it is a benchmark. Four experiments isolate the questions that matter:

Does the graph improve retrieval? Compare a vector RAG baseline, ontology-filtered vector retrieval, pure Cypher or SPARQL graph traversal, and hybrid GraphRAG on 50–100 expert-authored competency questions. Measure answer correctness, citation precision and recall, unsupported-statement rate, retrieval recall at k, latency, and token cost — separated by single-hop versus multi-hop questions [8][9].
Does ontology-constrained extraction improve KG quality? Compare unconstrained LLM triple extraction with ontology-constrained extraction plus SHACL validation. Evaluate a manually reviewed sample for entity normalisation accuracy, relation precision, unsupported assertion frequency, SHACL violation rate, and curator correction burden. The success criterion is not the number of extracted relations — it is the proportion that are scientifically interpretable, traceable, and reusable.
Do graph embeddings add value beyond text embeddings? Compare text-only baselines against KGE-only, fused text plus KGE, and ontology-aware fused models. Use a temporal split — hold out later publications — whenever the objective resembles scientific discovery; a random split lets future information leak into model development and exaggerates real-world performance [10].
Can the ontology evolve safely? Evaluate DeepOnto, LLMs4OM, or OntoAligner on curated known mappings plus deliberately difficult negative cases. For ontology maintenance, top-rank precision matters more than recall because one invalid mapping contaminates downstream queries [12].

What this means for Neuronautix work

For the HCMO, MBO, and NAMO ontologies, and the FAIR preclinical data exchange they support, the priority order is ontology and evidence quality first, interoperable semantic representation second, validated retrieval third, and GraphRAG plus predictive modelling fourth. This is the architecture being co-developed with LIRMM under the ontology-constrained LLM navigation workstream — using domain ontologies to constrain LLM agents traversing preclinical knowledge graphs, increasing efficacy, reducing hallucination, and enabling provenance-aware question answering. Embeddings and link prediction live at the application layer and produce ranked hypotheses; they do not silently modify canonical knowledge. GraphRAG is valuable for synthesis and multi-hop exploration. Ontology-aware LLM workflows are valuable for extraction and mapping. Statistical graph models are valuable for prediction. None of these should blur the distinction between sourced evidence, curated knowledge, and model-generated hypotheses [3].

References

[1] Microsoft GraphRAG — Reference implementation for entity-based corpus retrieval with community summaries and local/global/DRIFT search.
[2] DR.KNOWS: Diagnosis Prediction With Knowledge Graph and Large Language Models — Gao et al., JMIR AI 2025 (PMID 39993309). UMLS-derived KG paths can improve diagnostic prediction with LLMs but path quality is a critical bottleneck.
[3] Unifying Large Language Models and Knowledge Graphs: A Roadmap — Pan et al., IEEE TKDE 2024. KG-enhanced LLMs, LLM-augmented KGs, and integrated systems require explicit integration and validation rather than treating LLM outputs as ground truth.
[4] From Local to Global: A Graph RAG Approach to Query-Focused Summarization — Edge et al., arXiv:2404.16130, 2024. Entity graph plus community summaries improve comprehensiveness and diversity for corpus-wide sensemaking.
[5] LightRAG: Simple and Fast Retrieval-Augmented Generation — Guo et al., arXiv:2410.05779, 2024 (preprint). Dual-level graph and vector retrieval with incremental updates.
[6] HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models — Jiménez Gutiérrez et al., arXiv:2405.14831, 2024 (preprint). KG-like memory with Personalized PageRank for multi-hop associative retrieval.
[7] OG-RAG: Ontology-Grounded Retrieval-Augmented Generation For Large Language Models — Sharma et al., arXiv:2412.15235, 2024 (preprint). Ontology-grounded retrieval delivers a minimal, conceptually grounded context with improved correctness in the evaluated tasks.
[8] RAG vs. GraphRAG: A Systematic Evaluation and Key Insights — Han et al., arXiv:2502.11371, 2025 (preprint). Conventional RAG and GraphRAG are complementary: vanilla RAG tends to perform better for single-hop retrieval, GraphRAG for multi-hop reasoning and contextual synthesis.
[9] When to Use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation — Xiang et al., arXiv:2506.05690, 2025/26 (preprint). Graph-based retrieval may underperform vanilla RAG on some real-world tasks; graph use should be empirically justified.
[10] FuseLinker: Large language model-enhanced biomedical knowledge graph completion through text and structural representation fusion — Xiao et al., Journal of Biomedical Informatics 2024 (PMID 39326691). Text-plus-graph fusion outperforms either modality alone for biomedical KG completion.
[11] A large language model framework for literature-based disease–gene association prediction (LORE) — Li et al., Briefings in Bioinformatics 2025 (PMID 39998433). Verifiable factual statements as the substrate for semantic prediction.
[12] LLMs4OM: Matching Ontologies with Large Language Models — Giglou et al., arXiv:2404.10317, 2024 (preprint). LLM-based ontology matching using concept labels with parent/child context; companion line of work includes DeepOnto, OWL2Vec-Star, and OntoAligner.
[13] PyKEEN — Reproducible Python toolkit for knowledge graph embedding experiments (TransE, ComplEx, RotatE, GNNs). Practical reference for KGE benchmarking.
[14] Neo4j GraphRAG Python — First-party Neo4j GraphRAG library covering graph construction and hybrid retrieval. Companion: neosemantics for RDF import/export and SHACL validation inside Neo4j.
[15] Large language model powered knowledge graph construction for mental health exploration — Gao et al., Nature Communications 2025. LLM-assisted KG construction at scale (>10M relations); makes provenance and validation indispensable.
[16] Why scientific LLM workflows need metadata, ontologies, and graphs — not just longer context — Neuronautix, 18 May 2026. Companion note: structure first, generate last. This note is the operational sequel.

Work with Neuronautix

Apply this to your knowledge graph

Neuronautix designs ontology-grounded knowledge architectures for preclinical research and HCM data — including the LIRMM-partnered ontology-constrained LLM navigation workstream. If you have a KG and you are deciding what to build on top of it, we can help you scope the experiments, the layering, and the validation strategy.