Here is the text from my TEDxSiliconAlley talk, which was inspired by a back-and-forth I had with Marvin Minsky after the passing of McCarthy. It started when I asked him: "So if the homoiconic Lisp is the ideal programming language for AI, then where exactly are the macros of the mind?"
Arguably, the clearest and most concise description of our greatest goal was inscribed on the ancient temple of Apollo in Delphi: Gnothi Seauton, Know Thyself. Now no one knows exactly who said it first, or even who was smart enough to put it there, but given how long ago that was, we now have something new to ask: are we there yet? Do we yet know ourselves? Some students of artificial intelligence and neuroscience think we are indeed close.
Thankfully, these words, from Richard Feynman's last chalkboard, remind us that that which we cannot create, we just don't understand.
Today, I'd like to help identify an oft ignored prerequisite that stops us from crafting artificial minds, and it is: emotive speech synthesis. In other words, computerized voices as dynamic as our own.
Alan Turing proposed a simple test where we let a human guess whether an interrogated subject was human or machine. If the machine is erroneously chosen as often as the human, then the machine passes. This simple idea of being able to just often enough pass a Q&A with a human has enamored many in AI.
And so from chat bot to chat bot, programs have increasingly succeeded in bringing the textual banter of the machine closer to that of a human. But we're decades in now, and none of it works, because the machines never really "get" what we want to talk about. Why?
Well, computers, as we've designed them so far, are heuristic engines, that is, they find stuff, stuff that matches patterns. And for a long time we've thought that if we can give them good enough strategies and powerful enough hardware, then they'll be able to understand our world. But it doesn't work out, because the strategies are never general enough and the patterns are never comprehensive enough. In other words, you can ask a machine for the shortest route to the nearest hospital, and it'll find it for you.
But when even the shortest path is too long, it won't do what Dashrath Manjhi did, who after losing his wife to complications, spent the next 22 years, like a machine, but more notably as a human, digging a path through the mountain that had stood in their way.
You see, while heuristics may be and yield instructions, they never achieve the depth of the inspiring imperative we call understanding. Let's stop to think about that: understanding as imperative. Not surprisingly, early childhood development research shows that when we first start using our voice, we do so not to describe, but simply to affect. So when we say "ma", we try to keep ma around. Likewise, a parent's first understood use of speech will come in the form of simple commands like "open wide".
And from this, all the other subtleties of language bloom, as seemingly offensive and defensive extensions of persuasion. Thought about another way, speech at its core is mind control. So, making computers good at it, should be easy, right? I mean we execute programs as commands, so all we need to do is extend their language to include our own words. But that's not the case, fundamentally because the origin of language is not in words, it’s in articulation.
In fact our perception of sound is so slanted toward the physics of articulation, that what we see can actually distort what we hear. Consider the classic example of the McGurk effect. Here, when played the same sound several times over, we hear 'ba' 'ba' 'ba'. But when asked to listen again to the same exact sound while watching a video of a mouth articulating 'da', we instead hear 'da', 'da', 'da'.
This overriding of sound by perceived articulation hints at a common evolutionary basis for language that explains why non-vocalized, yet still physically articulated languages, like American Sign Language, are as expressive as any other first language. This property is known as multi-modality, but can be confusingly interrupted to include non-articulated communication like writing. That's a problem, because we start thinking of written language as equivalent to articulated language. But it's not, because when we speak, it's not just the words that come out, it's their feeling. That's not to say that writing can't imply feeling, but insofar as writing is just shorthand for the articulated, feeling must be left as an exercise for the reader.
All of a sudden, the traditional Natural Language Processing pipeline that we've used for so long in computer science becomes a hopelessly lossy conversion. In fact, had we simply required that the Turing test be administered over the phone, then traditional AI's word-centric approach would have soon come off as nothing more than a fool's errand.
But there's more, if articulation is the basis of human language and understanding is an imperative, then what we've just identified is a true API to the mind, a foundation for its externalized macros, rooted in not just what we say, but how we say it.
Now computerized voices are nothing new. Even the first Macintosh wowed us by introducing itself and saying hi. But a push toward AI that doesn't just use, but instead incorporates a wider emotive range of articulation is.
Now imagine a future where the work of synthetic voice directors and AI developers feeds back one into the other, unlocking a growing set of these macros.
- Where synthetic voice profiles can be given personality and timbre.
- Where games don't repeat the same lines over and over again.
- Where calls to customer service leave you scratching your head asking "was that a robot?"
And much more if I had the time to tell you. But for now, remember that we are all storytellers, eager as the poet Homer said, to live as song in the ears of the future.
And so, on our path to craft artificial minds, one that we find ourselves both destined and determined to take, we should not forget that there's an interface right below our brains, that gives us yet plenty to build at the bottom.