One of the more unexpected products to launch out of the Microsoft Ignite 2023 event is a tool that can create a photorealistic avatar of a person and animate that avatar saying things that the person didn't necessarily say.
Called Azure AI Speech text-to-speech avatar, the new feature, available in public preview as of today, lets users generate videos of an avatar speaking by uploading images of a person they wish the avatar to resemble and writing a script. Microsoft's tool trains a model to drive the animation, while a separate text-to-speech model -- either prebuilt or trained on the person's voice -- "reads" the script aloud.
"With text to speech avatar, users can more efficiently create video ... to build training videos, product introductions, customer testimonials [and so on] simply with text input," writes Microsoft in a blog post. "You can use the avatar to build conversational agents, virtual assistants, chatbots and more."
Avatars can speak in multiple languages. And, for chatbot scenarios, they can tap AI models like OpenAI's GPT-3.5 to respond to off-script questions from customers.
Now, there are countless ways such a tool could be abused -- which Microsoft to its credit realizes. (Similar avatar-generating tech from AI startup Synthesia has been misused to produce propaganda in Venezuela and false news reports promoted by pro-China social media accounts.) Most Azure subscribers will only be able to access prebuilt -- not custom -- avatars at launch; custom avatars are currently a "limited access" capability available by registration only and "only for certain use cases," Microsoft says.
But the feature raises a host of uncomfortable ethical questions.
One of the major sticking points in the recent SAG-AFTRA strike was the use of AI to create digital likenesses. Studios ultimately agreed to pay actors for their AI-generated likenesses. But what about Microsoft and its customers?
I emailed Microsoft about its position on companies using actors' likenesses without, in the actors' views, proper compensation or even notification. The company didn't respond as of publication time -- nor did it say whether it would require that companies label avatars as AI-generated, like YouTube and a growing number of other platforms.
In a follow-up email, a spokesperson clarified that Microsoft requires custom avatar customers to obtain "explicit written permission" and consent statements from avatar talent and "ensure that the customer’s agreement with each individual contemplates the duration, use and any content limitations." The company also mandates that customers add disclosures stating that the avatars have been created with AI and are AI-generated.
Microsoft appears to have more guardrails around a related generative AI tool, personal voice, that's also launching at Ignite.
Personal voice, a new capability within Microsoft's custom neural voice service, can replicate a user's voice in a few seconds provided a one-minute speech sample as an audio prompt. Microsoft pitches it as a way to create personalized voice assistants, dub content into different languages and generate bespoke narrations for stories, audio books and podcasts.
To ward off potential legal headaches, Microsoft's banning the use of prerecorded speech, requiring that users give "explicit consent" in the form of a recorded statement and verifying whether this statement matches other, one-time-use training data before a customer can use personal voice to synthesize new speech. Access to the feature is gated behind a registration form for the time being, and customers must agree to use personal voice only in applications "where the voice does not read user-generated or open-ended content."
"Voice model usage must remain within an application and output must not be publishable or shareable from the application," Microsoft writes in a blog post. "[C]ustomers who meet limited access eligibility criteria maintain sole control over the creation of, access to and use of the voice models and their output [where it concerns] dubbing for films, TV, video and audio for entertainment scenarios only."
Microsoft didn't initially answer TechCrunch's questions about how actors might be compensated for their voice contributions -- or whether it plans to implement any sort of watermarking tech so that AI-generated voices might be more easily identified.
Later in the day, a spokesperson said via email that watermarks will be automatically added to personal voices, making it easier to identify whether the speech is synthesized -- and which voice it was synthesized from. But there's a catch. Building watermark detection into an app or platform requires gaining approval from Microsoft to use its watermark detection service -- which obviously isn't ideal.
For more Microsoft Ignite 2023 coverage:
This story was originally published at 8am PT on Nov. 15 and updated at 3:30pm PT.