I wanted my AI assistant to have a consistent voice. Not just any voice—a personality. Something that would sound the same whether it was announcing a task completion, reading a summary, or celebrating a successful build.
The problem: I was using ElevenLabs, and while the voices were good, every time I wanted a different emotional tone I had to manually specify it. There was no connection between who my AI is and how it sounds.
So I built something different.
What if an AI's personality traits determined how it expressed emotions vocally?
Think about it: when someone confident gets bad news, they don't deflate the same way someone anxious does. Their voice stays steadier, more solution-focused. Same emotion, different expression—because of personality.
I wanted that for my AI assistant, Kai.
The system has three parts:
Message → Emotion Detection → Personality Filter → Voice Expression → TTSI found Qwen3-TTS from Alibaba—an open-source TTS model that accepts natural language descriptions for voice design. Instead of picking from preset voices, you describe what you want:
voice_prompt = """Slightly masculine androgynous young voice,
Japanese-accented, rapid speech pattern, futuristic AI friend
who thinks fast and talks fast, warm but efficient"""The model (Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign) generates speech matching that description. No voice IDs, no preset libraries—just describe the voice you want.
Instead of using word lists or sentiment analysis, I used LLM inference to detect emotions in real-time:
async def infer_emotion_with_llm(text: str, context: Optional[str] = None):
system_prompt = """Classify the emotional tone of this text.
Return JSON with: emotion (celebratory/confident/neutral/
concerned/frustrated/reflective/curious/focused/apologetic)"""
process = await asyncio.create_subprocess_exec(
"bun", "run", "Inference.ts", "--level", "fast", "--json",
system_prompt, text
)
# Returns: {"emotion": "confident", "intensity": "moderate"}The detector runs on every voice request, reading both the message and optional conversation context to pick the right emotion.
Here's where it gets interesting. I defined 12 personality traits on a 0-100 scale:
{
"enthusiasm": 60,
"energy": 75,
"expressiveness": 65,
"resilience": 85,
"composure": 70,
"optimism": 75,
"warmth": 70,
"formality": 30,
"directness": 80,
"precision": 95,
"curiosity": 90,
"playfulness": 45
}These traits shape how emotions manifest. High resilience (85) means the voice stays steady even when things go wrong. High precision (95) keeps the articulation clear even in emotional states. Low formality (30) keeps the tone casual.
The personality filter takes the detected emotion and produces voice instructions:
def get_personality_emotion_expression(emotion: str, personality: Personality):
# Example: emotion="concerned", resilience=85, composure=70
if emotion == "concerned":
if personality.resilience > 70:
return {
"vocal_qualities": "steady, focused",
"pacing": "measured but not slow",
"intensity": "controlled",
"notes": "stays solution-oriented"
}
else:
return {
"vocal_qualities": "slightly worried",
"pacing": "slower, careful",
"intensity": "subdued"
}The same emotion produces different voice instructions depending on the personality traits.
Everything combines into a natural language voice instruction:
Slightly masculine androgynous young voice, Japanese-accented,
rapid speech pattern, futuristic AI friend who thinks fast and
talks fast, warm but efficient, expressing confident, voice
assured and grounded, speaking steady, quick, at firm intensity,
articulate and exact even in this stateThis gets sent to Qwen3-TTS, which generates the audio.
The voice server exposes a personality endpoint:
curl -X POST http://localhost:8888/notify/personality \
-H "Content-Type: application/json" \
-d '{
"message": "Build complete. All tests passing.",
"personality": {
"name": "kai",
"base_voice": "Slightly masculine androgynous...",
"enthusiasm": 60,
"energy": 75,
...
}
}'Response:
{
"status": "success",
"emotion_detected": "celebratory",
"expression": {
"vocal_qualities": "bright but not manic",
"pacing": "quick delivery",
"intensity": "moderate"
}
}I updated all my hooks to use the new system. Every voice notification—task completions, status updates, system announcements—now goes through the personality endpoint with Kai's traits.
The result: consistent voice across all interactions. Same personality, appropriate emotional expression for context.
Personality is a filter, not a voice. The base voice stays constant. Personality shapes how emotions flow through it.
Natural language beats presets. Describing a voice in words is more intuitive than tweaking stability/similarity sliders.
Emotion detection needs context. A message like "Done." could be celebratory or exhausted—context matters.
Consistency requires infrastructure. Every voice call in my system now imports the same identity module and builds the same personality payload. One source of truth.
The code is part of my PAI (Personal AI Infrastructure) system. The voice server runs locally, the model runs on-device, and my AI assistant now has a voice that matches its personality.
That feels right.