Building an AI Voice Server with Personality-Based Emotion

How I replaced ElevenLabs with an open-source TTS system that shapes emotions through personality traits

January 25, 2026

I wanted my AI assistant to have a consistent voice. Not just any voice—a personality. Something that would sound the same whether it was announcing a task completion, reading a summary, or celebrating a successful build.

The problem: I was using ElevenLabs, and while the voices were good, every time I wanted a different emotional tone I had to manually specify it. There was no connection between who my AI is and how it sounds.

So I built something different.

The idea

What if an AI's personality traits determined how it expressed emotions vocally?

Think about it: when someone confident gets bad news, they don't deflate the same way someone anxious does. Their voice stays steadier, more solution-focused. Same emotion, different expression—because of personality.

I wanted that for my AI assistant, Kai.

The architecture

The system has three parts:

A TTS engine that accepts natural language voice descriptions
An LLM-based emotion detector that reads message context
A personality layer that shapes how detected emotions get expressed

Message → Emotion Detection → Personality Filter → Voice Expression → TTS

The TTS: Qwen3-TTS

I found Qwen3-TTS from Alibaba—an open-source TTS model that accepts natural language descriptions for voice design. Instead of picking from preset voices, you describe what you want:

python

voice_prompt = """Slightly masculine androgynous young voice,
Japanese-accented, rapid speech pattern, futuristic AI friend
who thinks fast and talks fast, warm but efficient"""

The model (Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign) generates speech matching that description. No voice IDs, no preset libraries—just describe the voice you want.

The emotion detector

Instead of using word lists or sentiment analysis, I used LLM inference to detect emotions in real-time:

python

async def infer_emotion_with_llm(text: str, context: Optional[str] = None):
    system_prompt = """Classify the emotional tone of this text.
    Return JSON with: emotion (celebratory/confident/neutral/
    concerned/frustrated/reflective/curious/focused/apologetic)"""

    process = await asyncio.create_subprocess_exec(
        "bun", "run", "Inference.ts", "--level", "fast", "--json",
        system_prompt, text
    )
    # Returns: {"emotion": "confident", "intensity": "moderate"}

The detector runs on every voice request, reading both the message and optional conversation context to pick the right emotion.

The personality layer

Here's where it gets interesting. I defined 12 personality traits on a 0-100 scale:

json

{
  "enthusiasm": 60,
  "energy": 75,
  "expressiveness": 65,
  "resilience": 85,
  "composure": 70,
  "optimism": 75,
  "warmth": 70,
  "formality": 30,
  "directness": 80,
  "precision": 95,
  "curiosity": 90,
  "playfulness": 45
}

These traits shape how emotions manifest. High resilience (85) means the voice stays steady even when things go wrong. High precision (95) keeps the articulation clear even in emotional states. Low formality (30) keeps the tone casual.

The personality filter takes the detected emotion and produces voice instructions:

python

def get_personality_emotion_expression(emotion: str, personality: Personality):
    # Example: emotion="concerned", resilience=85, composure=70

    if emotion == "concerned":
        if personality.resilience > 70:
            return {
                "vocal_qualities": "steady, focused",
                "pacing": "measured but not slow",
                "intensity": "controlled",
                "notes": "stays solution-oriented"
            }
        else:
            return {
                "vocal_qualities": "slightly worried",
                "pacing": "slower, careful",
                "intensity": "subdued"
            }

The same emotion produces different voice instructions depending on the personality traits.

The full instruction

Everything combines into a natural language voice instruction:

Slightly masculine androgynous young voice, Japanese-accented,
rapid speech pattern, futuristic AI friend who thinks fast and
talks fast, warm but efficient, expressing confident, voice
assured and grounded, speaking steady, quick, at firm intensity,
articulate and exact even in this state

This gets sent to Qwen3-TTS, which generates the audio.

The API

The voice server exposes a personality endpoint:

bash

curl -X POST http://localhost:8888/notify/personality \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Build complete. All tests passing.",
    "personality": {
      "name": "kai",
      "base_voice": "Slightly masculine androgynous...",
      "enthusiasm": 60,
      "energy": 75,
      ...
    }
  }'

Response:

json

{
  "status": "success",
  "emotion_detected": "celebratory",
  "expression": {
    "vocal_qualities": "bright but not manic",
    "pacing": "quick delivery",
    "intensity": "moderate"
  }
}

Integration

I updated all my hooks to use the new system. Every voice notification—task completions, status updates, system announcements—now goes through the personality endpoint with Kai's traits.

The result: consistent voice across all interactions. Same personality, appropriate emotional expression for context.

The stack

TTS Engine: Qwen3-TTS (Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign)
Server: FastAPI + uvicorn
Package Manager: uv (not pip, not venv)
Inference: Haiku via custom Inference.ts tool
Config: Personality in settings.json, loaded by all hooks

What I learned

Personality is a filter, not a voice. The base voice stays constant. Personality shapes how emotions flow through it.
Natural language beats presets. Describing a voice in words is more intuitive than tweaking stability/similarity sliders.
Emotion detection needs context. A message like "Done." could be celebratory or exhausted—context matters.
Consistency requires infrastructure. Every voice call in my system now imports the same identity module and builds the same personality payload. One source of truth.

The code is part of my PAI (Personal AI Infrastructure) system. The voice server runs locally, the model runs on-device, and my AI assistant now has a voice that matches its personality.

That feels right.

Building an AI Voice Server with Personality-Based Emotion

The idea ​

The architecture ​

The TTS: Qwen3-TTS ​

The emotion detector ​

The personality layer ​

The full instruction ​

The API ​

Integration ​

The stack ​

What I learned ​

The idea

The architecture

The TTS: Qwen3-TTS

The emotion detector

The personality layer

The full instruction

The API

Integration

The stack

What I learned