Audio Machine Learning · Audio Datasets · Multimodal AI

Hi, I’m Gijs Wijngaard.

I explore how machines listen and understand audio. I focus on building novel audio understanding models and datasets.

Currently a PhD candidate at the University of Maastricht, where I am supervised by Elia Formisano and Michel Dumontier.

I am also a Co-Founder of Encode Europe, where we aim to build a future in which AI benefits society through advocacy, education, and policy research.

Portrait of Gijs Wijngaard

Updates

Latest news and publications.

  • Sep 2025

    AudSemThinker at NeurIPS

    Our reasoning-enhanced audio-language model was accepted for poster session at NeurIPS 2025 in San Diego!

  • Jan 2025

    Survey Accepted to IEEE Access

    The comprehensive review of audio-language datasets is accepted to IEEE Access journal!

  • Jan 2023

    ACES at EUSIPCO

    Audio Captioning Evaluation on Semantics of Sound, our evaluation metric on semantic audio captioning, was accepted to EUSIPCO 2023 in Helsinki.

Selected Papers

Peer-reviewed highlights across audio understanding and datasets.

NeurIPS 2025 · 2025

AudSemThinker: Enhancing Audio-Language Models through Reasoning over Semantics of Sound

Introduces AudSemThinker, a reasoning-enriched audio-language model that outperforms state-of-the-art methods by structuring reasoning around auditory semantics, supported by a novel dataset AudSem.

Preprint · 2025

Data-Balanced Curriculum Learning for Audio Question Answering

Combines curriculum learning with statistical data balancing to improve audio QA accuracy by 11.7% on DCASE 2025, addressing dataset imbalances through difficulty-based training and category filtering.

IEEE Access · 2025

Audio-Language Datasets of Scenes and Events: A Survey

Survey of 69 audio-language datasets, analyzing their characteristics, biases, and challenges for training next-generation models.

EUSIPCO 2023

ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds

Introduces ACES, a novel metric for automated audio captioning that evaluates captions based on how humans derive semantic information from sounds, moving beyond traditional text-based metrics.