Introduction
The core goal of my SSI Fellowship is to facilitate knowledge exchange of skills and competencies between humanities and sciences. This includes improving my own knowledge and skills relating to the latest AI techniques and their relevance for teaching and learning.
In November 2025, I attended the ELIXIR BioHackathon Europe event at Bad Saarow near Berlin, Germany. The BioHackathon runs every year thanks to ELIXIR Europe and provides space for collaboration and innovation in computational biology and bioinformatics. Participants engage in intensive, hands-on programming and content-creation activities, data integration and software development. I joined the project ”Knowledge graphs for metadata on training”, led by Geert van Geest (SIB Swiss Institute of Bioinformatics), Harshita Gupta (SciLifeLab Sweden), and Vincent Emonet (SIB).
Outputs, report and user stories
The project has published a report of its outcomes on BioHackrXiv [precise link and citation to follow]. Highlights of the report include:
- A description of what a knowledge graph is, and how this structured data representation can greatly facilitate complex querying and applications to deep learning approaches like generative AI.
- A description of the training registry metadata sources we used (TeSS and Galaxy Training Network) and the most important metadata fields.
- A description of the code written to build the knowledge graph, and Model Context Protocol (MCP) server we developed that exposes a suite of tools for searching and querying training materials.
Several projects at the BioHackathon featured Model Context Protocol (MCP), and this was a new concept for me. MCP is an open standard produced by Anthropic (the company responsible for Claude); think of it as a more powerful kind of API. An MCP server gives AI agents a consistent way to connect with data, tools and services, while reducing hallucination. Agents can perform multi-step tasks, reducing the effort for the user who can interact via a natural language interface. For us, we can present a LLM chat bot that can query multiple training registries at once and provide answers to much deeper questions.
When I applied to this project, I had intended to work on the Python code directly. In the end, I changed my focus to provide project and community management. My most significant contribution to the project was to lead the development of 11 user stories. These user stories are a representative selection of the kinds of queries a user might make of the system, following a popular format:
- As a [user persona]
- I want to [do task]
- …so that [outcome/benefit]
An example user story we created is:
- As a bioinformatics scientist
- I want to define a learning path of training materials and/or events
- …so that I can become a specialist in artificial intelligence within the following specified time and resources: I have 6 months, a workload of 14 days, I live in Sweden and I can travel within Europe once.
Having a range of user stories ready to refer to throughout the project helps the developers to keep the requirements in mind when coding and allows them to focus on the software development.
Impact and opportunities
The project was a successful creation of a proof-of-concept for the ELIXIR ecosystem and beyond. There is a promising possibility of MCP for scientific applications, where used in combination with a LLM, to allow for natural language querying while including information from a trusted resource.
Participating in the BioHackathon gave me a better appreciation of current AI applications and features, in turn supporting me in my core work and with other aspects of my SSI Fellowship. I learned more about the importance of having persistent, unique identifiers for all nodes in a knowledge graph, and how it is valuable for various concepts described within the source materials to be described with an ID. The knowledge graph we built had gaps which had to be filled afterwards, and it would be more accurate if these gaps were already filled by the training providers before we scraped the training materials. We already associate the URL of a training course or event as the unique identifier for the material as a whole, but what about the trainers, the hosts and the locations of events? There are recommended authorities to consider, such as ORCiD for persons, ROR for organisations, and Open Street Map for places. I can now advocate for training registries and providers to host additional metadata for these inner entities within a given training material.
Thank you to the project leaders Geert van Geest, Harshita Gupta, Vincent Emonet, and all the other project members Finn Bacall, Jerven Bolleman, Jacobo Miranda, Dimitris Panouris. Thank you also to the people who kindly helped answer my questions during the hackathon: Alex Botzki, Eli Chadwick, Alban Gaignard, Helena Schnitzer, Ginger Tseung, Bérénice Batut, Carole Goble.