Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Google speech to text V2 API is slow.

From August 13 2024, it seems that google speech to text v2 api is taking too much for transcribing. Before that it was working fine. 

Have anyone experienced that. I am open to discuss and It will be very helpful.

Technologies used: Nodejs 20, speech to text v2 api

0 2 276
2 REPLIES 2

Hi bidur_nepali,
 
Thanks for reaching out in the Google Cloud Community. I understand you're experiencing increased latency with the Speech-to-Text v2 API since August 13, 2024. I apologize for the inconvenience this is causing.
 
To help me investigate this further, could you provide some more details? Specifics about the increased latency: Can you quantify how much longer it's taking? Are we talking about seconds, minutes, or more? Audio file details: What type of audio files are you transcribing (e.g., format, length, sample rate)? Code snippet: If possible, please share a sanitized version of your Node.js code that interacts with the Speech-to-Text API. This will help me identify any potential issues in your implementation. 
 
Regional endpoint: Which Google Cloud region are you using for the API?
 
In the meantime, here are a few things you can check:
  1. Quotas and limits: Ensure you haven't exceeded any quotas or limits for the Speech-to-Text API in your project.
  2. Network connectivity: Check your network connection to ensure there are no issues that might be causing delays.
  3. API status: Review the Google Cloud Status Dashboard to see if there are any reported issues with the Speech-to-Text API.
 
I'll be happy to assist you further once I have more information.
 
Looking forward to your response!
 
Best regards,
 
Jared

Thank you for reaching out. Here are the detailed situation:

Audio Length: 46 seconds

Audio Type: .mp3

Language: Japanese

Cloud Run Function Region: asia-northeast1

Recognizer Region: global

Node js code:

const { v2 } = require('@google-cloud/speech');
const speechClient = new v2.SpeechClient();
const { PubSub } = require('@google-cloud/pubsub');
const pubSubClient = new PubSub();
const recognizerId = await getRecognizer(languageCode, googleLanguageCode);
    const config = {
        autoDetectDecodingConfig: {},
        languageCodes: [googleLanguageCode],
        model: 'long',
        features: {
            enableWordTimeOffsets: true,
            enableAutomaticPunctuation: true,
        },
    };

    const fileMetadata = {
        uri: gs_uri,
    };

    const request = {
        recognizer: recognizerId,
        config,
        files: [fileMetadata],
        recognitionOutputConfig: {
            inlineResponseConfig: {},
        },
    };

    try {
        const [operation] = await speechClient.batchRecognize(request);
        let data = JSON.stringify({ operationName: operation.name, fileName: file.name, uri: gs_uri });
        const dataBuffer = Buffer.from(data);
        await pubSubClient.topic(pubsubTopicId).publishMessage({ data: dataBuffer });
        console.log(`Message published to the topic id: ${pubsubTopicId}`);
    } catch (error) {
        console.error('Error starting transcription:', error);
    }
}

async function getRecognizer(languageCode, googleLanguageCode) {
    const recognizerId = `${languageCode}recognizer`;
    const parent = speechClient.locationPath(projectId, 'global');
    const [recognizers] = await speechClient.listRecognizers({ parent });
    const recognizer = recognizers.find(r => r.name.endsWith(recognizerId));

    if (recognizer) {
        return recognizer.name;
    } else {
        const [operation] = await speechClient.createRecognizer({
            recognizer: {
                languageCodes: [googleLanguageCode],
                model: 'long',
                defaultRecognitionConfig: {
                    features: {
                        enableWordTimeOffsets: true,
                        enableAutomaticPunctuation: true,
                    },
                    autoDecodingConfig: {}
                }
            },
            recognizerId: recognizerId,
            parent: parent,
        });
        const [response] = await operation.promise();
        return response.name;
    }
}

The above code sends to transcribe the audio. And in another function, it's transcription progress is checked. The code is below:

 const [operation] = await speechClient.operationsClient.getOperation({ name: operationName });
 if (operation.done) {
                done = true;
                if (operation.response) {
                    console.log(`Operation ${operationName} is done. Processing results.`);
                    const BatchRecognizeResponse = protoRoot.lookupType('google.cloud.speech.v2.BatchRecognizeResponse');
                    const response = BatchRecognizeResponse.decode(operation.response.value);
                    console.log('here is the response:', response.results[gcsUri])
                    const inlineResult = response.results[gcsUri].inlineResult;
                    const transcriptResults = inlineResult.transcript.results;
                    const srt = gstt.convertGSTTToSRT(transcriptResults);
                    await streamFileUpload(fileName, srt).catch(console.error);
                } else if (operation.error) {
                    await streamFileUpload(fileName, "").catch(console.error);
                    console.error('No transcription results found in the response.');
                }
            } else {
                const OperationMetadata = protoRoot.lookupType('google.cloud.speech.v2.OperationMetadata');
                const metadata = OperationMetadata.decode(operation.metadata.value);
                console.log('here is the metadata:',metadata)
                const progressPercent = metadata.progressPercent || 0;
                console.log(`Transcription progress: ${progressPercent}%`);
                await new Promise(resolve => setTimeout(resolve, interval));
                interval = Math.min(interval * 2, 4000);
            }

Before August 13, the same audio took around 11 seconds to complete but now it's taking more than 1 hour to complete.

Quota seems fine. And network connectivity is almost same. I have tested with more than 1 audio and audio lengths are all less than 10 seconds except that 46 seconds one. All the reports seems fine. The transcription completes but takes more than 1 hour.
I hope the above information helps. I am looking for quick response. Thank you.