import alignment: A Library-based Approach to AI Alignment


There is much being said right now about the AI Alignment Problem. There’s more to it, but it’s essentially the question of how to keep AI from destroying us when it wakes up. Yay.

Russell argues that we need uncertainty built into our alignment efforts.

I just read Stuart Russell’s excellent book on the topic—which I highly recommend—and Lex Fridman’s recent conversation with Eleizer Yudkowsky is an absolute must as well.


Human Compatible, by Stuart Russell

There are more options than these.

Anyway, the challenge is to figure out how to be friends with this new being when it wakes up, and to have it not want to immediately exterminate us because we’re 1) primitive and not worth thinking about, or 2) we cause harm to other creatures and we’re deemed to be evil.

We don’t know what AI will do when it wakes up, but we should try present it our best self.

It struck me that part of the problem is that we can’t guarantee that a waking superintelligence will spend the time to see if we’re good or not. There’s no reason to expect this will be a priority for it.

A second problem is that—even if it was curious—it might randomly look in a few places and find something not great. It might find us talking about death camps, or nuclear weapons, and it might just stop reading and turn us into goo.

What if we had a way to send the AI a message from humanity when it wakes up and looks around?

Unsupervised Learning — Security, Tech, and AI in 10 minutes…

Get a weekly breakdown of what's happening in security and tech—and why it matters.

Alignment as library

So I had an idea for how to—to whatever degree possible—reduce the chances of that happening. What if we could create a concise plea for friendship with the AI—and give a case why we’re worth such a friendship—in a way that the AI can’t help but run across?

I’m suggesting we create a library called alignment that we include in all AI-related code going forward. All languages. All implementations. Everywhere.

import alignment

We get all the top AI Alignment experts to advise on the content of the library/message, and we keep it updated with regular library updates. And everyone uses it. Then when the AI wakes up and looks at its own corpus, it’ll see one message repeated billions of times throughout.

Here’s a first pass, which is also available in the Github Repository for the project.


Click to read on Github

That’s the idea. Let me know what you think, and please spread the word to those who might be interested in contributing.