Introducing The Swahili Thinking Dataset

The first open-source Swahili chain-of-thought reasoning dataset

Swahili Thinking Dataset visualization

Introduction

Today, we are excited to release the Swahili Thinking Dataset, the first open-source dataset for chain-of-thought reasoning in Swahili. This dataset contains 166 high-quality examples of conversational AI responses where models explicitly demonstrate their reasoning process before generating final answers.

While chain-of-thought reasoning datasets exist for major languages like English, French, and Spanish, there are no publicly accessible high-quality chain-of-thought reasoning datasets for African languages. The Swahili Thinking Dataset addresses this gap, enabling researchers and developers to build more capable Swahili language models that can think before they respond.

Dataset Overview

The dataset was created by professionally translating 200 English examples from the HuggingFaceH4/Multilingual-Thinking dataset using GPT-5 Pro, resulting in 166 successful translations. Each example demonstrates explicit reasoning in Swahili across diverse conversational scenarios.

Dataset Structure

Each example contains six fields following the Harmony response format:

Example Conversations

Example 1: Currency exchange reasoning

Example showing multi-step reasoning about currency exchange rates, demonstrating the model's ability to break down complex queries into logical steps.

Example 2: Location-based service reasoning

The model reasons through how to help users find nearby grocery stores, considering multiple approaches and regional context.

Example 3: Book recommendation reasoning

Detailed reasoning process for recommending historical fiction books, showing consideration of multiple criteria and diverse options.

Loading the Dataset

You can load the dataset directly from HuggingFace:

from datasets import load_dataset

dataset = load_dataset("Nadhari/Swahili-Thinking", split="train")

# Access first example
example = dataset[0]
print(example['user'])      # User query in Swahili
print(example['analysis'])  # Chain-of-thought reasoning
print(example['final'])     # Final response

Use Cases

This dataset enables several important applications:

Dataset Statistics

Future Plans

This release is just the beginning. We plan to:

Community Contributions

We welcome contributions from the community. If you would like to help expand this dataset or have suggestions for improvement, please:

Citation

If you use this dataset in your research, please cite:

@misc{swahili-thinking-dataset-2025,
  title={Swahili Thinking Dataset},
  author={Nadhari AI},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/datasets/Nadhari/Swahili-Thinking}
}

Acknowledgments

This work builds upon the excellent Multilingual-Thinking dataset by HuggingFace H4. We are grateful for their contribution to the open-source AI community.

Seti ya Data ya Mantiki za Kiswahili

Dataset ya kwanza ya wazi ya ufikiri wa mnyororo wa mawazo katika Kiswahili

Muonekano wa Swahili Thinking Dataset

Utangulizi

Leo, tunafurahi kutambulisha Swahili Thinking Dataset, dataset ya kwanza ya wazi (open-source) kwa ufikiri wa mnyororo wa mawazo katika Kiswahili. Dataset hii ina mifano 166 ya ubora wa juu ya majibu ya AI ya mazungumzo ambapo modeli huonyesha wazi mchakato wao wa kufikiri kabla ya kutoa majibu ya mwisho.

Ingawa datasets za ufikiri wa mnyororo wa mawazo zipo kwa lugha kuu kama Kiingereza, Kifaransa na Kihispania, hakuna datasets za ubora wa juu za aina hii zinazopatikana hadharani kwa lugha za Afrika. Swahili Thinking Dataset inashughulikia pengo hili, ikiwezesha watafiti na wasanidi programu kujenga modeli za Kiswahili zenye uwezo zaidi zinazoweza kufikiri kabla ya kujibu.

Muhtasari wa Dataset

Dataset hii iliundwa kwa kutafsiri kitaalamu mifano 200 ya Kiingereza kutoka kwa dataset ya HuggingFaceH4/Multilingual-Thinking kwa kutumia GPT-5 Pro, na kuzaa jumla ya tafsiri 166 zilizofaulu. Kila mfano unaonyesha ufikiri wa wazi kwa Kiswahili katika mazingira mbalimbali ya mazungumzo.

Muundo wa Dataset

Kila mfano una sehemu sita zinazofuata muundo wa majibu wa Harmony:

Mifano ya Mazungumzo

Mfano 1: Ufikiri wa ubadilishaji wa sarafu

Mfano unaonyesha ufikiri wa hatua nyingi kuhusu viwango vya ubadilishaji wa sarafu, ukionyesha uwezo wa modeli kujibu maswali magumu katika hatua za kimantiki.

Mfano 2: Ufikiri wa huduma kulingana na eneo

Modeli inafikiri hatua kwa hatua jinsi ya kuwasaidia watumiaji kupata maduka ya vyakula yaliyo karibu nao, ikizingatia mbinu tofauti na muktadha wa kikanda.

Mfano 3: Ufikiri wa mapendekezo ya vitabu

Mchakato wa kina wa ufikiri wa kupendekeza vitabu vya hadithi za kihistoria, ukionyesha kuzingatia vigezo vingi na chaguo mbalimbali.

Kupakia Dataset

Unaweza kupakua dataset hii moja kwa moja kutoka HuggingFace:

from datasets import load_dataset

dataset = load_dataset("Nadhari/Swahili-Thinking", split="train")

# Fikia mfano wa kwanza
example = dataset[0]
print(example['user'])      # Swali la mtumiaji kwa Kiswahili
print(example['analysis'])  # Ufikiri wa mnyororo wa mawazo
print(example['final'])     # Jibu la mwisho

Matumizi

Dataset hii inafungua matumizi muhimu kadhaa:

Takwimu za Dataset

Mipango ya Baadaye

Toleo hili ni mwanzo tu. Tunapanga:

Michango ya Jamii

Tunakaribisha michango kutoka kwa jamii. Ikiwa ungependa kusaidia kupanua dataset hii au una mapendekezo ya uboreshaji, tafadhali:

Nukuu

Iwapo utatumia dataset hii katika utafiti wako, tafadhali taja nukuu ifuatayo:

@misc{swahili-thinking-dataset-2025,
  title={Swahili Thinking Dataset},
  author={Nadhari AI},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/datasets/Nadhari/Swahili-Thinking}
}

Shukrani

Kazi hii imejengwa juu ya dataset bora ya Multilingual-Thinking kutoka HuggingFace H4. Tunawashukuru kwa mchango wao mkubwa kwa jamii ya AI ya chanzo huria.