How to replace Google NotebookLM voices

There are quite a few videos and articles out there about how you can replace the Google NotebookLM voices using, for example, ElevenLabs. While ElevenLabs will undoubtedly sound great, it does come at a cost.

In this article we will look at how you can replace the Google NotebookLM voices on the cheap, as in really, really cheap, as in cents. For the most part, this will be free, as long as you stay within Azure’s TTS (Text to Speech) limits. There is also no subscription fees, it is pay-as-you-go.

You can read more about Azure TTS pricing here: https://azure.microsoft.com/en-us/pricing/details/cognitive-services/speech-services/

You can easily put the below in 1 big Python/Node/etc file or “Low Code no Code”/DAG of your choice and pretty much automate the enitre process (except for step 1 perhaps).


Step 1: Create your Google NotebookLM podcast

Head over to https://notebooklm.google.com/ and create your podcast. As a super quick recap, you essentially just add your sources and then click on the “Load” button in the “Audio Overview” section.

Download the podcast audio to your local computer.

Google NotebookLM Audio Overview




Step 2: Create a free Deepgram account

The Google Podcast is in mono, meaning there is no stereo separation between the 2 speakers. Often times, especially in call centers, you will have the call center agent on e.g. the left channel, and the customer on e.g. the right channel. For podcasts though, this will sound weird, hence it is in mono.

Diarization aims to isolate the speakers so that there is a clear separation between who said what. This is where Deepgram comes in.

Head over to https://deepgram.com/ and create a free account. Free as in no credit card needed. They give your $200 to play with, for free! This will last you quite a long time.

Once registered, create a new API key. Just use the default permissions.

New Deepgram API key


Step 3: Diarize and Transcribe the Podcast using Deepgram

Here we will use Deppgram’s Speech to Text (STT) API that also diarizes on the fly, so we only have to make a single API call.

I urge you to check out their API documentation as they can do sentiment analysis and a whole bunch of other things: https://developers.deepgram.com/docs/

Here is the cURL to diarize and transcribe the Google NotebookLM podcast. You can import this into Postman and point it to your downloaded podcast file.

curl --location 'https://api.deepgram.com/v1/listen?smart_format=true&punctuate=true&diarize=true&language=en-US&model=nova-2&filler_words=true' \
--header 'Authorization: Token <API KEY HERE>' \
--header 'Content-Type: text/plain' \
--data-binary '@/C:/temp/googlelm.mp3'

After about 15 seconds you will get your response. It is quite a big JSON file, but essentially we are only interested in this part of it: json[“results”][“channels”][0][“alternatives”][0][“paragraphs”][“paragraphs”]

Here is a quick breakdown of the section were are interested in:

  • The orange blocks are our speakers (speaker 0 and speaker 1) and the green blocks are the text they spoke, i.e. the “utterance“.
  • The “Start” key is the time the speaker started talking, and “End” is the time the speaker stopped talking.
  • You will see in the screenshot that speaker 0 essentially had 2 utterances right after each other. The reason it is not one big long sentence is because Deepgram, and most other STT solutions, will detect a pause (of either a set length or a length the user can provide) as the end of an utterance.
Deepgram transcribed snippet


Now save this JSON response to a file, e.g. c:\temp\deepgram.json


Step 4: Convert the JSON to SSML

Next, we need to get the spoken sentences into SSML. SSML stands for “Speech Synthesis Markup Language” and is essentially just an XML document formatted in a specific way to help TTS or “Text to Speech” systems understand what it should do with the provided text. For example, you can add pauses after sentences, provide a specific voice for a specific sentence, etc.

We will be using Azure’s HD voices to assign a unique voice to speaker 0 and a voice to speaker 1.

I will not be going into too much detail when it comes to the below code. The only thing you need to do is to point the code to your downloaded Deepgram JSON file (#Step 2).

High level, this is what the code does:

  • Creates a simpler JSON object that is much easier to work with.
  • Combines utterances spoken right after each other by the same speker into 1 big sentence as it can help models understand context and creates much better tone and emotion that would otherwise get lost in very short utterances.
  • Creates the SSML from the JSON, allocating each speaker their own voice.
  • Saves the SSL (XML file) to disk.

import json
import xml.etree.ElementTree as ET
import xml.dom.minidom

# Step 1
#Create our variables
speakers = {}
speakers["all"] = []

#Define speaker-to-voice mapping
speaker_voices = {
    0: "en-US-Andrew:DragonHDLatestNeural",
    1: "en-US-Ava:DragonHDLatestNeural"
}

#Step 2
#Read the deepgram transcript with diarization
with open('c:\\temp\\deepgram.json') as f:
    data = json.load(f)

#Step 3
#Get all paragraphs
all_sentences = (data["results"]["channels"][0]["alternatives"][0]["paragraphs"]["paragraphs"])

#Step 4
#Iterate through all the paragraphs and get the start and end time of the main sentence and add it to our main JSON
#We dont really have to compute short_talk_duration, but it could come in handy for analysis later on 
for loop in range(0,len(all_sentences)): #range(0,15): #
    sentence = all_sentences[loop]
    main_sentence_start = sentence["start"]
    main_sentence_end = sentence["end"]
    main_talk_duration = main_sentence_end - main_sentence_start
    speaker = sentence["speaker"] #will be 0 or 1

    for s in sentence["sentences"]:
        text = s["text"]
        short_sentence_start = s["start"]
        short_sentence_end = s["end"]
        short_talk_duration = short_sentence_end - short_sentence_start
        speakers["all"].append({"speaker": speaker, 
                                "text": text, 
                                "start": short_sentence_start, 
                                "end": short_sentence_end, 
                                "short_talk_duration": short_talk_duration
                            })


#Step 5
#Lets combine multiple utterances spoken by the same speaker righ after each other into 1 big sentence.
#This could help models understand better what is going on and improve tone and emotion, especially
#in very short sentences
speakers_compact = {}
speakers_compact["all"] = []

combined_text = ""
current_speaker = ""
old_speaker = None
for i in range(0, len(speakers["all"])):
    current_speaker = speakers["all"][i]["speaker"]

    if current_speaker != old_speaker:
        #Dont append the first time
        if old_speaker is not None:
            speakers_compact["all"].append({"speaker": old_speaker, "text": combined_text})
        
        old_speaker = current_speaker
        combined_text = ""

    combined_text = combined_text + " " + speakers["all"][i]["text"]
    

#Step 6
#In this section we will create the SSML
def generate_tts_xml(json_data, speaker_voices, pretty):
    #Root <speak> element
    speak = ET.Element(
        "speak",
        attrib={
            "xmlns:mstts": "http://www.w3.org/2001/mstts",
            "xmlns:emo": "http://www.w3.org/2009/10/emotionml",
            "xml:lang": "en-US",
        },
        version="1.0"
    )
    
    current_speaker = None
    voice_element = None
    
    #Process utterances
    for item in json_data["all"]:
        speaker = item["speaker"]
        
        #Check if speaker changes
        if speaker != current_speaker:
            current_speaker = speaker
            voice_name = speaker_voices.get(speaker, "en-US-JennyNeural")
            voice_element = ET.SubElement(speak, "voice", name=voice_name)
        
        #Add text content
        text_element = ET.SubElement(voice_element, "s")
        text_element.text = item["text"]

    #Convert to a string and pretty-print
    raw_xml = ET.tostring(speak, encoding="unicode", method="xml")

    if pretty is True:
        dom = xml.dom.minidom.parseString(raw_xml)
        return dom.toprettyxml(indent="  ")
    else:
        return raw_xml.replace('"','\\"')

azure_tts_xml = generate_tts_xml(speakers_compact, speaker_voices, False)

#Step 7
#Write the XML to disk
with open(str("c:\\temp\\ssmsRAW.xml"), "w") as f:
    f.write(azure_tts_xml)

When calling generate_tts_xml, your can pass through “True” as the last argument to get a properly formatted XML file. Pass “False” to get a “raw” file that can be passed to Azure TTS.

Here is a snippet of what the SSML looks like:

SSM snippet


Step 5: Create an Azure Speech Service

Create an Azure Speech Service. At the time of writing, TTS and HD Voices are not supported by the new Azure AI Foundry. So create the below “Speech Service” resource in the Azure Portal (https://portal.azure.com).

You can choose the free pricing tier. I am not sure if HD Voices will work in the free tier but give it a shot. If it doesn’t then use tier S0.

Azire AI Services Speech Services


Once the resource is created, get your your API key and Endpoint. Keey your keys safe!

Azure Speech Services API Key and Endpoint


Step 6: Send the SSML to Azure Speech Services

We will make another API call using Postman.

  • The purple text is a unique identifier. Make sure it is unique for each of your TTS calls. This unique key is also how you get the URL to download the audio, so perhaps save it somewhere.
  • The red text is the XML output from step 4. Remember to pass “False” as the third parameter to function generate_tts_xml
  • You can play around with the “outputFormat” but the provided option is sufficient.

You can read more about this API here:

curl --location --request PUT 'https://eastus.api.cognitive.microsoft.com/texttospeech/batchsyntheses/MustbeSuperUnique111?api-version=2024-04-01' \
--header 'Ocp-Apim-Subscription-Key: <KEY HERE>' \
--header 'Content-Type: application/json' \
--data '{
    "description": "My Postman call",
    "inputKind": "SSML",
    "inputs": [
        {
            "content": "<XML OUTPUT HERE>"
		}
    ],
    "properties": {
        "outputFormat": "audio-16khz-128kbitrate-mono-mp3",
        "wordBoundaryEnabled": false,
        "sentenceBoundaryEnabled": false,
        "concatenateResult": true, 
        "decompressOutputFiles": false
    }
}'

The response will look like this:

  • The most important parts right now is “Status” and “timetoLiveHours”.
  • Status tells us if Azure has started with the TTS job, if it is busy, or if it is done with the job.
  • timeToLiveHours tells us how long after Azure completed the job the audio will be available for download (in hours). You can modify this value in your initial call to make it shorter or longer.
Azure Speech Services Initial Response


You can poll your job to see if it is completed by calling this API:

curl --location 'https://eastus.api.cognitive.microsoft.com/texttospeech/batchsyntheses/MustbeSuperUnique1112?api-version=2024-04-01' \
--header 'Ocp-Apim-Subscription-Key: <KEY HERE>'

Note the unique value in purple, it must match the unique value used in your initial call.

Once the job is done, the output will look like this:

  • We can see that status is “Succeeded” and we have an “outputs” key with a very long URL.
  • This long URL (I blurred most of it out) is a SAS token. This is essentially your secret URL to download the audio file.
  • You will also see the value in purple matches our unique value

You can just copy and paste that long SAS URL into a browser and the audio file will be downloaded. Now you have successfully replaced the Google NotebookLM voices using a different voice.

Azure Speech Services Job Done Response

As a side note, the “billingDetails” key can contain any of the following. Seeing I am still within my free character limit, there is no billing taking place.

You can read more about billingDetails here:

Azure Speech Services billingDetails

necrolingus

Tech enthusiast and home labber