Audio & Transcription

Text-to-Speech (TTS) & Speech-to-Text (STT) Examples with Pawa AI

Text-to-Speech (TTS)

Text-to-Speech (TTS) allows you to generate high-quality, natural-sounding speech directly from text.
This capability is essential for building voice-enabled applications, making content accessible to wider audiences, and creating immersive user experiences.

By default, our tts api endpoint implements streaming, so in your app you can directly use Server Side Event to get streaming of the audio back. If you dont use the SSE then the api will fallback to normal non sctreaming to wait for the full audio to be generated back give back answer.

You can learn about streaming here.

Supported Languages For Text to Speech

Swahili
English

Use Cases

Voice assistants: Let your chatbot respond with speech instead of text only.
Learning platforms: Automatically generate audio versions of documents, lessons, or Q&A sessions.
Accessibility tools: Help users with visual impairments interact with your app through audio.
Media & podcasts: Generate narrations from written articles or blogs.

Voice options

The TTS endpoint provides 3 built‑in local voices to control how speech is rendered from text. This includes

Ame
Liora
Ayana

Models with Text to speech Audio Capabilities

Pawa Text To Speech (pawa-tts-v1-20250704) with text to speech conversation.

Example Playback Speech

Original Text: “Jina la jamhuri ya muungano wa Tanzania, ni nchi iliyopo Afrika ya Mashariki ndani ya ukanda wa maziwa makuu ya Afrika, imepakana na Uganda na Kenya upande wa kaskazini, Bahari ya Hindi upande wa mashariki, Msumbiji malawi na Zambia upande wa kusini, Congo, Burundi na Rwanda upande wa magharibi, eneo la Tanzania ni takribani kilometa za mraba 940 mb/h. Saa arobaini na dakika elfu 300, eneo linalokaliwa na maji ne asalimia 6.2 - Mlima Kilimanjaro - Mlima mrefu zaidi barani Afrika upo kaskazini mashariki wa Tanzania.”

Text to Speech Request Example

curl --request POST \
     --url https://api.pawa-ai.com/v1/voice/text-to-speech \
     --header "Authorization: Bearer $PAWA_AI_API_KEY" \
     --header 'Content-Type: application/json' \
      --data '{
                "model": "pawa-tts-v1-20250704",
                "voice": "ame",
                "max_tokens": 65536,
                "temperature": 0.5,
                "top_p": 0.95,
                "text": "Hello, welcome to Pawa AI. Upgrade now to enjoy Unlimited access to advanced AI"
                "repetition_penalty": 1.1
}' \
     --output speech.mp3

Check the audio file saved in current directory, open it to play and listen to generated audio.

Speech-to-Text (STT)

Speech-to-Text (STT) converts audio into text with high accuracy. With optional speaker identity, timestamps.
This is powerful for transcription, audio search, summarization, and voice-enabled interfaces.

Supported Languages For Text to Speech

Swahili
English
Luo
Meru
kamba
Hausa
Igbo
Yoruba
Zulu
Tswana
Nyakole etc…

Use Cases

Meeting & call centers transcription: Turn long discussions into structured notes.
Customer service: Convert call center conversations into searchable text.
Education: Transcribe lectures, podcasts, and webinars.
Productivity: Voice notes and dictation apps.

Models with Speech to Text Audio Capabilities

Pawa Speech To Text (pawa-stt-v1-20240701) with audio input to text conversation.

Original Playback Speech

Example Text: “Jina la jamhuri ya muungano wa Tanzania, ni nchi iliyopo Afrika ya Mashariki ndani ya ukanda wa maziwa makuu ya Afrika, imepakana na Uganda na Kenya upande wa kaskazini, Bahari ya Hindi upande wa mashariki, Msumbiji malawi na Zambia upande wa kusini, Congo, Burundi na Rwanda upande wa magharibi, eneo la Tanzania ni takribani kilometa za mraba 940 mb/h. Saa arobaini na dakika elfu 300, eneo linalokaliwa na maji ne asalimia 6.2 - Mlima Kilimanjaro - Mlima mrefu zaidi barani Afrika upo kaskazini mashariki wa Tanzania.”

Speech to Text Request `without Speaker Diarization` Example

curl  --request POST \
      --url https://api.pawa-ai.com/v1/voice/speech-to-text \
      --header "Authorization: Bearer $PAWA_AI_API_KEY" \
      --header 'Content-Type: multipart/form-data' \
      --form files=@sartify_info2.wav \
      --form model=pawa-stt-v1-20240701 \
      --form language=English \
      --form is_speaker_diarization=false 

Example Response Without Speaker Diarization

{
  "success": true,
  "message": "Audio transcribed succesfully",
 "data": {
    "transcriptions": [
      {
        "filename": "hello.wav",
        "transcript": "okay let's break down the products certify has built so far here's a list based on available information hasto astory poll this is the flagship product it's an ai power document processing platform essentially it uses ai to understand and process documents likely automating tasks like data extraction classification and more astor astory to warried an ai powered educational tool designed to improve learning outcomes it likely provides personalized learning experiences tutoring assistance and other educational supports as says the power models these aren't a product per se but rather the underlying ai models that power their other products there are a suite of smaller specialized ai models focusing on african languages vision and audio think of them as the engine behind docipro and tutorai it's worth noting that satilfy is a relatively young company so their product line is still"
      }
    ]
  }
}

Change is_speaker_diarization to true to enable speaker identity or diarization capability on the response

Example Response With Speaker Diarization

{
  "success": true,
  "message": "Audio transcribed succesfully",
 "data": {
    "transcriptions": [
      {
        "filename": "hello.wav",
        "transcript": [
          {
            "Start": 0,
            "End": 14.87,
            "Speaker": 0,
            "Content": "Okay, let's break down the products Certify has built so far. Here's a list based on available information. Astor Astoripo, this is their flagship product. It's an AI-powered document processing platform."
          },
          {
            "Start": 14.9,
            "End": 24.61,
            "Speaker": 0,
            "Content": "Essentially, it uses AI to understand and process documents, likely automating tasks like data extraction, classification, and more."
          },
          {
            "Start": 24.87,
            "End": 38.81,
            "Speaker": 0,
            "Content": "As well as Certifiturai, an AI-powered educational tool designed to improve learning outcomes. It likely provides personalized learning experiences, tutoring assistance, and other educational sup-"
          },
          {
            "Start": 39.05,
            "End": 47.55,
            "Speaker": 0,
            "Content": "As such, as we power models, we aren't a product per se, but rather the underlying AI models that power the other products."
          },
          {
            "Start": 47.67,
            "End": 65.76,
            "Speaker": 0,
            "Content": "There are a suite of smaller specialized AI models focusing on African languages, vision, and audio. Think of them as the engine behind Docipro and TutorAI. It's worth noting that Certify is a relatively young company, so their product line is still evolving."
          }
        ]
      }
    ]
  }
}

You can upload up to 10 files of types audio/mp3, audio/mpeg, audio/x-mp3, audio/wav, audio/wave, audio/x-wav, audio/x-pn-wav, audio/aac, audio/m4a, audio/x-m4a, audio/x-mp4, audio/ogg, audio/opus, audio/x-ms-wma, audio/wma with each of less 500MB. The server runs in batch mode of 5 files default.

Getting Started

Learn More

Capabilities

Agents

Going Production

Guides

Resources

Text-to-Speech (TTS)

Supported Languages For Text to Speech

Use Cases

Voice options

Models with Text to speech Audio Capabilities

Text to Speech Request Example

Speech-to-Text (STT)

Supported Languages For Text to Speech

Use Cases

Models with Speech to Text Audio Capabilities

Speech to Text Request `without Speaker Diarization` Example

Example Response Without Speaker Diarization

Example Response With Speaker Diarization

Getting Started

Learn More

Capabilities

Agents

Going Production

Guides

Resources

​Text-to-Speech (TTS)

​Supported Languages For Text to Speech

​Use Cases

​Voice options

​Models with Text to speech Audio Capabilities

​Text to Speech Request Example

​Speech-to-Text (STT)

​Supported Languages For Text to Speech

​Use Cases

​Models with Speech to Text Audio Capabilities

​Speech to Text Request without Speaker Diarization Example

​Example Response Without Speaker Diarization

​Example Response With Speaker Diarization

Text-to-Speech (TTS)

Supported Languages For Text to Speech

Use Cases

Voice options

Models with Text to speech Audio Capabilities

Text to Speech Request Example

Speech-to-Text (STT)

Supported Languages For Text to Speech

Use Cases

Models with Speech to Text Audio Capabilities

Speech to Text Request `without Speaker Diarization` Example

Example Response Without Speaker Diarization

Example Response With Speaker Diarization