Google speech to text V2 API is slow.

bidur_nepali · 08-14-2024 09:28 PM

From August 13 2024, it seems that google speech to text v2 api is taking too much for transcribing. Before that it was working fine.

Have anyone experienced that. I am open to discuss and It will be very helpful.

Technologies used: Nodejs 20, speech to text v2 api

CincyAI513

Hi bidur_nepali,

Thanks for reaching out in the Google Cloud Community. I understand you're experiencing increased latency with the Speech-to-Text v2 API since August 13, 2024. I apologize for the inconvenience this is causing.

To help me investigate this further, could you provide some more details? Specifics about the increased latency: Can you quantify how much longer it's taking? Are we talking about seconds, minutes, or more? Audio file details: What type of audio files are you transcribing (e.g., format, length, sample rate)? Code snippet: If possible, please share a sanitized version of your Node.js code that interacts with the Speech-to-Text API. This will help me identify any potential issues in your implementation.

Regional endpoint: Which Google Cloud region are you using for the API?

In the meantime, here are a few things you can check:

Quotas and limits: Ensure you haven't exceeded any quotas or limits for the Speech-to-Text API in your project.
Network connectivity: Check your network connection to ensure there are no issues that might be causing delays.
API status: Review the Google Cloud Status Dashboard to see if there are any reported issues with the Speech-to-Text API.

I'll be happy to assist you further once I have more information.

Looking forward to your response!

Best regards,

Jared

bidur_nepali

Thank you for reaching out. Here are the detailed situation:

Audio Length: 46 seconds

Audio Type: .mp3

Language: Japanese

Cloud Run Function Region: asia-northeast1

Recognizer Region: global

Node js code:

const { v2 } = require('@google-cloud/speech');
const speechClient = new v2.SpeechClient();
const { PubSub } = require('@google-cloud/pubsub');
const pubSubClient = new PubSub();
const recognizerId = await getRecognizer(languageCode, googleLanguageCode);
    const config = {
        autoDetectDecodingConfig: {},
        languageCodes: [googleLanguageCode],
        model: 'long',
        features: {
            enableWordTimeOffsets: true,
            enableAutomaticPunctuation: true,
        },
    };

    const fileMetadata = {
        uri: gs_uri,
    };

    const request = {
        recognizer: recognizerId,
        config,
        files: [fileMetadata],
        recognitionOutputConfig: {
            inlineResponseConfig: {},
        },
    };

    try {
        const [operation] = await speechClient.batchRecognize(request);
        let data = JSON.stringify({ operationName: operation.name, fileName: file.name, uri: gs_uri });
        const dataBuffer = Buffer.from(data);
        await pubSubClient.topic(pubsubTopicId).publishMessage({ data: dataBuffer });
        console.log(`Message published to the topic id: ${pubsubTopicId}`);
    } catch (error) {
        console.error('Error starting transcription:', error);
    }
}

async function getRecognizer(languageCode, googleLanguageCode) {
    const recognizerId = `${languageCode}recognizer`;
    const parent = speechClient.locationPath(projectId, 'global');
    const [recognizers] = await speechClient.listRecognizers({ parent });
    const recognizer = recognizers.find(r => r.name.endsWith(recognizerId));

    if (recognizer) {
        return recognizer.name;
    } else {
        const [operation] = await speechClient.createRecognizer({
            recognizer: {
                languageCodes: [googleLanguageCode],
                model: 'long',
                defaultRecognitionConfig: {
                    features: {
                        enableWordTimeOffsets: true,
                        enableAutomaticPunctuation: true,
                    },
                    autoDecodingConfig: {}
                }
            },
            recognizerId: recognizerId,
            parent: parent,
        });
        const [response] = await operation.promise();
        return response.name;
    }
}

The above code sends to transcribe the audio. And in another function, it's transcription progress is checked. The code is below:

 const [operation] = await speechClient.operationsClient.getOperation({ name: operationName });
 if (operation.done) {
                done = true;
                if (operation.response) {
                    console.log(`Operation ${operationName} is done. Processing results.`);
                    const BatchRecognizeResponse = protoRoot.lookupType('google.cloud.speech.v2.BatchRecognizeResponse');
                    const response = BatchRecognizeResponse.decode(operation.response.value);
                    console.log('here is the response:', response.results[gcsUri])
                    const inlineResult = response.results[gcsUri].inlineResult;
                    const transcriptResults = inlineResult.transcript.results;
                    const srt = gstt.convertGSTTToSRT(transcriptResults);
                    await streamFileUpload(fileName, srt).catch(console.error);
                } else if (operation.error) {
                    await streamFileUpload(fileName, "").catch(console.error);
                    console.error('No transcription results found in the response.');
                }
            } else {
                const OperationMetadata = protoRoot.lookupType('google.cloud.speech.v2.OperationMetadata');
                const metadata = OperationMetadata.decode(operation.metadata.value);
                console.log('here is the metadata:',metadata)
                const progressPercent = metadata.progressPercent || 0;
                console.log(`Transcription progress: ${progressPercent}%`);
                await new Promise(resolve => setTimeout(resolve, interval));
                interval = Math.min(interval * 2, 4000);
            }

Before August 13, the same audio took around 11 seconds to complete but now it's taking more than 1 hour to complete.

Quota seems fine. And network connectivity is almost same. I have tested with more than 1 audio and audio lengths are all less than 10 seconds except that 46 seconds one. All the reports seems fine. The transcription completes but takes more than 1 hour.
I hope the above information helps. I am looking for quick response. Thank you.