Sitemap
A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.
Pages
Posts
Future Blog Post
Published:
This post will show up by default. To disable scheduling of future posts, edit config.yml and set future: false.
Blog Post number 4
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Blog Post number 3
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Blog Post number 2
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
Blog Post number 1
Published:
This is a sample blog post. Lorem ipsum I can’t remember the rest of lorem ipsum and don’t have an internet connection right now. Testing testing testing this blog post. Blog posts are cool.
portfolio
Labadain Chat (Old Portal)
An AI-powered assistant for Tetun that enables natural language interaction, supports local language use, and leverages retrieval-augmented generation (RAG) to extend the benefits of modern AI technologies. It is accessible at https://old.labadain.com. 
Labadain Chat
An agentic AI assistant for Tetun that enables natural language interaction. It is designed to support local language use by extending the original Labadain platform (old portal) with modern technology and enhanced features. The assistant is accessible at www.labadain.com. 
Labadain Search
A web-based traditional search engine for Tetun. It is accessible at www.labadain.tl. 
Timor News (Old Site)
The first version of Timor News online outlet, dedicated exclusively to publishing news in Tetun. It is accessible at old.timornews.tl. 
Timor News
An online news outlet dedicated exclusively to publishing news in Tetun. It is accessible at www.timornews.tl. 
publications
Text Information Retrieval in Tetun: A Preliminary Study
Author(s): Gabriel de Jesus .
Published in the 10th edition of the PhD Symposium on FDIA, Lisbon, Portugal, July 20, 2023
This paper presents preliminary work on text information retrieval for Tetun, investigating ad-hoc search methods to support effective search solutions in this low-resource language.
Text Information Retrieval in Tetun
Author(s): Gabriel de Jesus .
Published in the 45th European Conference on Information Retrieval (ECIR 2023), Dublin, Ireland, April 2–6, 2023
This work addresses the lack of information retrieval solutions for Tetun by investigating ad-hoc text retrieval methods and developing datasets and resources to support effective search in this low-resource language.
Network-based Approach for Stopwords Detection
Author(s): Felermino Ali, Gabriel de Jesus, Henrique Cardoso, Rui Sousa-Silva, Sérgio Nunes .
Published in the 16th International Conference on ComputationalbProcessing of the Portuguese Language (PROPOR 2024), Santiago de Compostela, Galicia, Spain, 12–15 March, 2024
This paper introduces a network-based approach for automatic stopword detection in low-resource languages, tested on Tetun and Emakhuwa. By leveraging co-occurrence network properties, the method outperforms traditional frequency-based techniques, offering a scalable solution for NLP in under-resourced linguistic contexts.
Labadain-30k+: A monolingual Tetun document-level audited dataset
Author(s): Gabriel de Jesus, Sérgio Nunes .
Published in the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages at LREC-COLING 2024, Torino, Italia, 20–25 May, 2024
This paper presents Labadain-30k+, a Tetun text dataset of 33.6k documents manually audited at the document level, alongside a content analysis highlighting the evolution of web documents and trends in written Tetun.
Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus
Author(s): Gabriel de Jesus, Sérgio Nunes .
Published in the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, 20–25 May, 2024
This paper introduces Labadain Crawler, a web-based pipeline for building text corpora in low-resource languages, and demonstrates its effectiveness by constructing a high-quality Tetun corpus from over 22,000 web pages.
Exploring Large Language Models for Relevance Judgments in Tetun
Author(s): Gabriel de Jesus, Sérgio Nunes .
Published in the First Workshop on Large Language Models for Evaluation in Information Retrieval (LLM4Eval 2024), co-located with 10th International Conference on Online Publishing (SIGIR 2024), Washington D.C., USA, July 18, 2024
This paper examines the use of large language models to automate relevance judgments in Tetun, showing agreement levels comparable to human assessors and to results reported for high-resource languages.
Insights into LLM-Based Conversational Search: A Study of Tetun-Speaking Users’ Search Behavior
Author(s): Gabriel de Jesus, Sérgio Nunes .
Published in the 11th ACM SIGIR / the 15th International Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR), Padua, Italy, 18 July, 2025
This paper analyzes real-world prompt logs from an LLM-based conversational assistant for Tetun speakers, revealing user search behaviors and releasing LabadainLog-17k+, a new dataset for conversational search in Tetun.
Zero-Shot and Hybrid Strategies for Tetun Ad-Hoc Text Retrieval
Author(s): Gabriel de Jesus, Siddharth AK Singh, Sérgio Nunes, Andrew Yates .
Published in the 11th ACM SIGIR / the 15th International Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR), Padua, Italy, 18 July, 2025
This work explores zero-shot dense and hybrid retrieval methods for Tetun, highlighting the effectiveness of combining pretrained models with LLM-enhanced document representations.
Cross-Lingual Information Retrieval in Tetun for Ad-Hoc Search
Author(s): Altedio Araújo, Gabriel de Jesus, Sérgio Nunes .
Published in the 24th EPIA Conference on Artificial Intelligence, Faro, Portugal, 1–3 October, 2025
This paper introduces the first CLIR baseline for Tetun, evaluating translation-based retrieval across multiple languages and highlighting the challenges for Tetun.
Establishing a Foundation for Tetun Ad-Hoc Text Retrieval: Stemming, Indexing, Retrieval, and Ranking
Author(s): Gabriel de Jesus, Sérgio Nunes .
Published in arXiv, 2025
This paper establishes foundational components for Tetun ad-hoc text retrieval, examining stemming, indexing, retrieval, and ranking strategies to enable effective search in this low-resource language.
talks
Conference Proceedings Talk on Labadain-30k+ Dataset Construction
Published:
This conference proceedings talk presents the construction of Labadain-30k+, a manually audited Tetun text dataset, outlining the data collection pipeline, quality control process, and key insights from content analysis to support NLP and information retrieval research in a low-resource language.
Conference Proceedings Talk on the Labadain Crawler Pipeline for LRLs
Published:
This conference proceedings talk presents the Labadain Crawler, a web-based data collection pipeline designed for low-resource languages, detailing its architecture, language processing components, and its application to building a high-quality Tetun text corpus.
Labadain: The Foundation of Tetun Language Technology
Published:
This talk introduces Labadain as the foundation of Tetun language technology, showcasing how datasets, tools, and AI systems enable inclusive digital access for Tetun speakers.
AI for Tetun: Building Timor-Leste’s Inclusive Digital Future
Published:
This talk explores how artificial intelligence can support Tetun, a low-resource and official language of Timor-Leste, by enabling inclusive digital access through language technologies, datasets, and information retrieval systems.
teaching
Operating Systems
Undergraduate teaching (invited lecturer), Faculty of Engineering, National University of Timor-Leste, 2009
Taught undergraduate course on Operating Systems to third-year Informatics Engineering students at the Faculty of Engineering, National University of Timor-Leste (UNTL) in Dili, from February to June 2009.
Information, Communication, and Technology
Undergraduate teaching (invited lecturer), Faculty of Law, National University of Timor-Leste, 2013
Taught undergraduate course on Introduction to Information, Communication, and Technology to second-year law students in the Faculty of Law at the National University of Timor-Leste in Dili, from September to December 2013.
Data Mining and Data Warehouse
Undergraduate teaching (invited lecturer), Instituto Profissional de Canossa (IPDC), Computer Engineering, 2015
Taught undergraduate courses in Data Warehouse and Data Mining to second-year Computer Engineering students at the Instituto Profissional de Canossa (IPDC) in Dili, Timor-Leste, from February to June 2015.
Game Programming and OOP
Undergraduate teaching (invited lecturer), Instituto Profissional de Canossa (IPDC), Computer Engineering, 2016
Taught undergraduate courses in Game Programming and Object-Oriented Programming (OOP) to third-year Computer Engineering students at the Instituto Profissional de Canossa (IPDC) in Dili, Timor-Leste, from February to June 2016.
Cross-Lingual Information Retrieval in Tetun
Supervisor, Escola Superior de Tecnologia e Gestão (ESTG), Felgueiras, Portugal, 2025
Supervised an undergraduate student in Information Systems at the Escola Superior de Tecnologia e Gestão (ESTG), from September 2024 to July 2025. The project involved developing cross-lingual information retrieval systems for Tetun, addressing challenges in multilingual search for low-resource languages.
Semantic Search in Sports Search Engine
Co-Supervisor, Faculty of Engineering, University of Porto (FEUP), Porto, Portugal, 2025
Co-supervised a master’s student in Informatics Engineering at FEUP, from September 2024 to July 2025. The project focused on integrating semantic search capabilities into a sports search engine, enhancing the retrieval of relevant sports-related information through advanced natural language processing techniques.
