Publications

You can also find my articles on my Google Scholar profile.

Journal Articles


Conference Papers


Cross-Lingual Information Retrieval in Tetun for Ad-Hoc Search

Author(s): Altedio Araújo, Gabriel de Jesus, Sérgio Nunes .

Published in the 24th EPIA Conference on Artificial Intelligence, Faro, Portugal, 1–3 October, 2025

This paper introduces the first CLIR baseline for Tetun, evaluating translation-based retrieval across multiple languages and highlighting the challenges for Tetun.

Download Paper

Zero-Shot and Hybrid Strategies for Tetun Ad-Hoc Text Retrieval

Author(s): Gabriel de Jesus, Siddharth AK Singh, Sérgio Nunes, Andrew Yates .

Published in the 11th ACM SIGIR / the 15th International Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR), Padua, Italy, 18 July, 2025

This work explores zero-shot dense and hybrid retrieval methods for Tetun, highlighting the effectiveness of combining pretrained models with LLM-enhanced document representations.

Download Paper

Insights into LLM-Based Conversational Search: A Study of Tetun-Speaking Users’ Search Behavior

Author(s): Gabriel de Jesus, Sérgio Nunes .

Published in the 11th ACM SIGIR / the 15th International Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR), Padua, Italy, 18 July, 2025

This paper analyzes real-world prompt logs from an LLM-based conversational assistant for Tetun speakers, revealing user search behaviors and releasing LabadainLog-17k+, a new dataset for conversational search in Tetun.

Download Paper

Exploring Large Language Models for Relevance Judgments in Tetun

Author(s): Gabriel de Jesus, Sérgio Nunes .

Published in the First Workshop on Large Language Models for Evaluation in Information Retrieval (LLM4Eval 2024), co-located with 10th International Conference on Online Publishing (SIGIR 2024), Washington D.C., USA, July 18, 2024

This paper examines the use of large language models to automate relevance judgments in Tetun, showing agreement levels comparable to human assessors and to results reported for high-resource languages.

Download Paper

Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus

Author(s): Gabriel de Jesus, Sérgio Nunes .

Published in the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, 20–25 May, 2024

This paper introduces Labadain Crawler, a web-based pipeline for building text corpora in low-resource languages, and demonstrates its effectiveness by constructing a high-quality Tetun corpus from over 22,000 web pages.

Download Paper

Labadain-30k+: A monolingual Tetun document-level audited dataset

Author(s): Gabriel de Jesus, Sérgio Nunes .

Published in the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages at LREC-COLING 2024, Torino, Italia, 20–25 May, 2024

This paper presents Labadain-30k+, a Tetun text dataset of 33.6k documents manually audited at the document level, alongside a content analysis highlighting the evolution of web documents and trends in written Tetun.

Download Paper

Network-based Approach for Stopwords Detection

Author(s): Felermino Ali, Gabriel de Jesus, Henrique Cardoso, Rui Sousa-Silva, Sérgio Nunes .

Published in the 16th International Conference on ComputationalbProcessing of the Portuguese Language (PROPOR 2024), Santiago de Compostela, Galicia, Spain, 12–15 March, 2024

This paper introduces a network-based approach for automatic stopword detection in low-resource languages, tested on Tetun and Emakhuwa. By leveraging co-occurrence network properties, the method outperforms traditional frequency-based techniques, offering a scalable solution for NLP in under-resourced linguistic contexts.

Download Paper

Text Information Retrieval in Tetun

Author(s): Gabriel de Jesus .

Published in the 45th European Conference on Information Retrieval (ECIR 2023), Dublin, Ireland, April 2–6, 2023

This work addresses the lack of information retrieval solutions for Tetun by investigating ad-hoc text retrieval methods and developing datasets and resources to support effective search in this low-resource language.

Download Paper

Text Information Retrieval in Tetun: A Preliminary Study

Author(s): Gabriel de Jesus .

Published in the 10th edition of the PhD Symposium on FDIA, Lisbon, Portugal, July 20, 2023

This paper presents preliminary work on text information retrieval for Tetun, investigating ad-hoc search methods to support effective search solutions in this low-resource language.

Download Paper