Project 2

Adapting SIRIUS and beyond for Electron Ionization fragmentation

Prof. Dr. Sebastian Böcker (main supervisor)
Bioinformatics, Faculty of Mathematics and Computer Science, FSU Jena
Prof. Dr. Georg Pohnert (co-supervisor)
Instrumental Analytics, Institute for Inorganic and Analytical Chemistry, Faculty of Chemistry and Earth Sciences, FSU Jena

Mass spectrometry (MS) is the analytical platforms of choice for high-throughput screening of small molecules. MS is typically used in combination with a chromatographic separation technology; gas chromatography (GC-MS) is arguably still the best separation tool for compounds amenable to the technique. Electron (impact) Ionization (EI) simultaneously ionizes and fragments the molecules; resulting spectra are fragment-rich but often show a low-intensity or missing molecular ion peak, meaning that the mass of the compound is often unknown. Lately, technically mature GC-MS instruments with high mass accuracy are available, making de novo interpretation of EI fragmentation data possible.

The Böcker group is one of the leading research groups developing computational methods for untargeted metabolomics. Numerous scientific approaches for this task were developed in our lab during the last decade, including CSI:FingerID for searching in molecular structure databases (PNAS, 2015), SIRIUS for molecular formula annotation and processing of full datasets (Nat Methods, 2019), CANOPUS for comprehensive compound class assignment (Nat Biotechnol, 2021), and COSMIC for assigning confidence in annotations (Nat Biotechnol, 2022). We have won numerous CASMI challenges on the topic, and our web services for small molecule annotation have processed about half a billion queries.

Project description:
With the advent of high mass accuracy GC-MS instrumentation, it becomes possible to adapt our computational tools for GC-MS data. GC-MS and EI fragmentation is different in many details from LC-MS and tandem MS, and several subproblems must be addressed; for example:

  1. EI mass spectra are often missing the molecular ion peak, and the mass and/or molecular formula of the compound has to be reconstructed from the fragments using Machine Learning and combinatorics.
  2. EI mass spectra contain isotope patterns, which can be used to improve fragmentation tree quality. Unfortunately, radical losses H and H3 often interfere with the interpretation of the isotope patterns.
  3. Available reference data for high mass accuracy GC-MS is insufficient to train Machine Learning methods. To bypass this, we want to “lift” low mass accuracy spectra and add them to the training data.

We will also promptly apply developed methods to biological data. The project will be conducted in close collaboration with experimental research groups around the globe, in particular that of Prof. Georg Pohnert.

Candidate profile:

  • M.Sc. in bioinformatics, cheminformatics, computer science, mathematics
  • Expertise and interest in algorithmics and bioinformatics methods development
  • Experience in biochemistry desirable
  • Expertise in Machine Learning highly desirable
  • Experience in software development (Git, artifactory) is a must
  • Experience in Java, Python and ML frameworks is desirable
  • Ability to interact with coworkers, collaboration partners and software users


  1. M. A. Stravs, K. Dührkop, S. Böcker, and N. Zamboni. MSNovelist: de novo structure generation from mass spectra. Nat Methods, 2022.
  2. M. A. Hoffmann, …, S. Böcker. High-confidence structural annotation of metabolites absent from spectral libraries. Nat Biotechnol, 40(3):411–421, 2022.
  3. K. Dührkop, …, S. Böcker. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nat Biotechnol, 39(4):462–471, 2021.
  4. R. Schmid, …, P. C. Dorrestein. Ion identity molecular networking for mass spectrometry-based metabolomics in the GNPS environment. Nat Commun, 12(1):3832, 2021.
  5. M. Ludwig, …, S. Böcker. Database-independent molecular formula annotation using Gibbs sampling through ZODIAC. Nat Mach Intell, 2(10):629–641, 2020.
  6. Tripathi, …, P. C. Dorrestein. Chemically-informed analyses of metabolomics mass spectrometry data with Qemistree. Nat Chem Biol, 17(2):146–151, 2021.
  7. L.-F. Nothias, …, P. C. Dorrestein. Feature-based molecular networking in the GNPS analysis environment. Nat Methods, 17(9):905–908, 2020.
  8. K. Dührkop, …, S. Böcker. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat Methods, 16(4):299–302, 2019.
  9. K. Scheubert, …, S. Böcker. Significance estimation for large scale metabolomics annotations by spectral matching. Nat Commun, 8:1494, 2017.
  10. K. Dührkop, H. Shen, M. Meusel, J. Rousu, and S. Böcker. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc Natl Acad Sci USA, 112(41):12580–12585, 2015.


Go to Editor View