Labadain Chat: A Conversational Agent for the Tetun Language
Author(s): Gabriel de Jesus and Sérgio Nunes
Published in: 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2026)
Melbourne, Australia, 20–24 July, 2026
This study presents Labadain Chat, a conversational agent for Tetun that adapts existing LLMs using language-specific prompting strategies and details its architecture, functionalities, applications, and value for the Tetun-speaking community.
Preprint Available Soon
Zero-Shot and Hybrid Strategies for Tetun Ad-Hoc Text Retrieval
Author(s): Gabriel de Jesus, Siddharth AK Singh, Sérgio Nunes, and Andrew Yates
Published in: 11th ACM SIGIR / 15th International Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR 2025)
Padua, Italy, July 18, 2025
This work explores zero-shot dense and hybrid retrieval methods for Tetun, highlighting the effectiveness of combining pretrained models with LLM-enhanced document representations.
Insights into LLM-Based Conversational Search: A Study of Tetun-Speaking Users' Search Behavior
Author(s): Gabriel de Jesus and Sérgio Nunes
Published in: 11th ACM SIGIR / 15th International Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR 2025)
Padua, Italy, July 18, 2025
This paper analyzes real-world prompt logs from an LLM-based conversational assistant for Tetun speakers and introduces LabadainLog-17k+, a new dataset for conversational search in Tetun.
Exploring Large Language Models for Relevance Judgments in Tetun
Author(s): Gabriel de Jesus and Sérgio Nunes
Published in: First Workshop on Large Language Models for Evaluation in Information Retrieval (LLM4Eval 2024), co-located with 10th International Conference on Online Publishing (SIGIR 2024)
Washington D.C., USA, July 18, 2024
This paper examines the use of large language models to automate relevance judgments in Tetun, showing agreement levels comparable to human assessors and to results reported for high-resource languages.
Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus
Author(s): Gabriel de Jesus and Sérgio Nunes
Published in: Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Turin, Italy, 20–25 May, 2024
This paper introduces Labadain Crawler, a web-based pipeline for building text corpora in low-resource languages, and demonstrates its effectiveness by constructing a high-quality Tetun text corpus.
Labadain-30k+: A monolingual Tetun document-level audited dataset
Author(s): Gabriel de Jesus and Sérgio Nunes
Published in: 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages at LREC-COLING 2024.
Turin, Italy, 20–25 May, 2024
This paper presents Labadain-30k+, a Tetun text dataset comprising 33.6k documents manually audited at the document level, alongside a content analysis that highlights the evolution of web documents and emerging trends in written Tetun.