Publications

Selected conference papers and research publications.

You can also view the full list of my publications on the Google Scholar profile.

Labadain Chat: A Conversational Agent for the Tetun Language

Author(s): Gabriel de Jesus and Sérgio Nunes

Published in: 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2026)

Melbourne, Australia, 20–24 July, 2026

This study presents Labadain Chat, a conversational agent for Tetun that adapts existing LLMs using language-specific prompting strategies and details its architecture, features, applications, and values for the Tetun-speaking community.

Download Paper

Zero-Shot and Hybrid Strategies for Tetun Ad-Hoc Text Retrieval

Author(s): Gabriel de Jesus, Siddharth AK Singh, Sérgio Nunes, and Andrew Yates

Published in: 11th ACM SIGIR / 15th International Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR 2025)

Padua, Italy, July 18, 2025

This work explores zero-shot dense and hybrid retrieval methods for Tetun, highlighting the effectiveness of combining pretrained models with LLM-enhanced document representations.

Download Paper

Insights into LLM-Based Conversational Search: A Study of Tetun-Speaking Users' Search Behavior

Author(s): Gabriel de Jesus and Sérgio Nunes

Published in: 11th ACM SIGIR / 15th International Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR 2025)

Padua, Italy, July 18, 2025

This paper analyzes real-world prompt logs from an LLM-based conversational assistant for Tetun speakers and introduces LabadainLog-17k+, a new dataset for conversational search in Tetun.

Download Paper

Exploring Large Language Models for Relevance Judgments in Tetun

Author(s): Gabriel de Jesus and Sérgio Nunes

Published in: First Workshop on Large Language Models for Evaluation in Information Retrieval (LLM4Eval 2024), co-located with 10th International Conference on Online Publishing (SIGIR 2024)

Washington D.C., USA, July 18, 2024

This paper examines the use of large language models to automate relevance judgments in Tetun, showing agreement levels comparable to human assessors and to results reported for high-resource languages.

Download Paper

Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus

Author(s): Gabriel de Jesus and Sérgio Nunes

Published in: Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Turin, Italy, 20–25 May, 2024

This paper introduces Labadain Crawler, a web-based pipeline for building text corpora in low-resource languages, and demonstrates its effectiveness by constructing a high-quality Tetun text corpus.

Download Paper

Labadain-30k+: A monolingual Tetun document-level audited dataset

Author(s): Gabriel de Jesus and Sérgio Nunes

Published in: 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages at LREC-COLING 2024.

Turin, Italy, 20–25 May, 2024

This paper presents Labadain-30k+, a Tetun text dataset comprising 33.6k documents manually audited at the document level, alongside a content analysis that highlights the evolution of web documents and emerging trends in written Tetun.

Download Paper