Generate data insights in BigQuery

Data insights offers an automated way to explore and understand your data. It uses Gemini to generate queries based on the metadata of a table, and helps you uncover patterns, assess data quality, and perform statistical analysis.

This document describes the key features of data insights and the process to automate query generation for insightful data exploration.

Before you begin

Data insights are generated using Gemini in BigQuery. To start generating insights, you must first set up Gemini in BigQuery.

Required roles

To create, manage, and retrieve data insights, ask your administrator to grant you the following IAM roles:

To get read-only access to the generated insights, ask your administrator to grant you the following IAM role:

  • Dataplex DataScan DataViewer (roles/dataplex.dataScanDataViewer) on the project containing the BigQuery tables for which you want to view insights.

For more information about granting roles, see Manage access to service accounts.

You might also be able to get the required permissions through custom roles or other predefined roles. To see the exact permissions that are required generate insights, expand the Required permissions section:

Required permissions

  • bigquery.jobs.create
  • bigquery.tables.get
  • bigquery.tables.getData
  • dataplex.datascans.create
  • dataplex.datascans.get
  • dataplex.datascans.getData
  • dataplex.datascans.run

Enable APIs

To use data insights, enable the following APIs in your project:

For more information about enabling the Gemini for Google Cloud API, see Enable the Gemini for Google Cloud API in a Google Cloud project.

About data insights

When exploring a new, unfamiliar table, data analysts often face the cold-start problem. The problem often involves uncertainties about the data structure, key patterns, and relevant insights in the data, making it hard to get started writing queries. Data insights addresses the cold-start problem by automatically generating queries in natural language and SQL based on the table's metadata. Rather than starting with an empty query editor, you can quickly start data exploration with meaningful queries that offer valuable insights. Queries generated using data insights are grounded using published profile scan data to improve their accuracy and usefulness.

Example of an insights run

Consider a table called telco_churn with the following metadata:

Field name Type
CustomerID STRING
Gender STRING
Tenure INT64
InternetService STRING
StreamingTV STRING
OnlineBackup STRING
Contract STRING
TechSupport STRING
PaymentMethod STRING
MonthlyCharges FLOAT
Churn BOOLEAN

The following are some of the sample queries that data insights generates for this table:

  • Identify customers who have subscribed to all premium services and have been a customer for more than 50 months.

    SELECT
      CustomerID,
      Contract,
      Tenure
    FROM
      agentville_datasets.telco_churn
    WHERE
      OnlineBackup = 'Yes'
      AND TechSupport = 'Yes'
      AND StreamingTV = 'Yes'
      AND Tenure > 50;
    
  • Identify which internet service has the most churned customers.

    SELECT
      InternetService,
      COUNT(DISTINCT CustomerID) AS total_customers
    FROM
      agentville_datasets.telco_churn
    WHERE
      Churn = TRUE
    GROUP BY
      InternetService
    ORDER BY
      total_customers DESC
    LIMIT 1;
    
  • Identify churn rates by segment among high-value customers.

    SELECT
      Contract,
      InternetService,
      Gender,
      PaymentMethod,
      COUNT(DISTINCT CustomerID) AS total_customers,
      SUM(CASE WHEN Churn = TRUE THEN 1 ELSE 0 END) AS churned_customers,
      (SUM(CASE WHEN Churn = TRUE THEN 1 ELSE 0 END) / COUNT(DISTINCT CustomerID))
      * 100 AS churn_rate
    FROM
      agentville_datasets.telco_churn
    WHERE
      MonthlyCharges > 100
    GROUP BY
      Contract,
      InternetService,
      Gender,
      PaymentMethod;
    

About grounding insights using profile scans

When you create a data profile scan for a table, you can choose to publish the scan results to BigQuery and Dataplex Catalog. BigQuery uses the published profile scan data to generate more accurate and relevant queries about the table.

Profile scan data provides information about the data distribution, data types, and statistical summaries of the dataset. Data insights uses profile scan data to create queries that are based on the actual data distribution and patterns within the dataset. This process involves the following steps:

  • Analyzing the profile scan data to identify interesting patterns, trends, or outliers in the data
  • Generating queries that focus on these patterns, trends, or outliers to uncover insights
  • Validating the generated queries against the profile scan data to ensure that the queries return meaningful results

If no scans are available, insights are generated based on the column names and descriptions.

Tips for improving data insights

Grounding queries ensures that the generated insights are based on actual trends and real values in your data. To make the most out of grounded queries using profile scan data, follow these tips:

  • Ensure that your table has up-to-date published profile scan data.
  • Review the generated queries to understand how they are grounded in the profile scan data.
  • Adjust your profile scan settings to increase the sampling size and filter rows and columns.
  • Add descriptions to the table and its columns. More accurate and detailed descriptions can help produce higher quality insights.
  • After you have updated your profile scan or descriptions, regenerate insights for your table.

Generate insights for a BigQuery table

To generate insights for a BigQuery table, you must access the table entry in BigQuery using BigQuery Studio.

  1. In the Google Cloud console, go to BigQuery Studio.

    Go to BigQuery Studio

  2. In the Explorer pane, select the table you want to generate insights for.

  3. Click the Insights tab. If the tab is empty, it means that the insights for this table are not generated yet.

  4. To trigger the insights pipeline, click Generate insights.

    It takes a few minutes for the insights to be populated.

    If a published profile scan for the table is available, it's used to generate comprehensive insights. Otherwise, insights are generated based on the column names and descriptions. This approach ensures that you receive insights regardless of the availability of a profile scan.

  5. In the Insights tab, explore the generated queries and their descriptions.

  6. To open a query in BigQuery, click Copy to Query.

  7. To generate a new set of queries, click Generate insights and trigger the pipeline again.

After you have generated insights for a table, anyone with the dataplex.datascans.getData permission and access to the table can view those insights.

Generate insights for a BigQuery external table

BigQuery data insights supports generating insights for BigQuery external tables with data in Cloud Storage. You and the Dataplex service account for the current project must have the Storage Object Viewer (roles/storage.objectViewer) role on the Cloud Storage bucket that contains the data. For more information, see Add a principal to a bucket-level policy.

To generate insights for a BigQuery external table, follow the instructions described in the Generate insights for a BigQuery table section of this document.

Generate insights for a BigLake table

To generate insights for a BigLake table, follow these steps:

  1. Enable the BigQuery Connection API in your project.

    Enable the BigQuery Connection API

  2. Create a BigQuery connection. For more information, see Manage connections.

  3. Grant the Storage Object Viewer (roles/storage.objectViewer) IAM role to the service account corresponding to the BigQuery connection that you created.

    You can retrieve the service account ID from the connection details.

  4. To generate insights, follow the instructions described in the Generate insights for a BigQuery table section of this document.

Pricing

For details about pricing for this feature, see Gemini in BigQuery pricing overview.

Quotas and limits

For information about quotas and limits for this feature, see Quotas for Gemini in BigQuery.

Limitations

  • Data insights is available for BigQuery tables, BigLake tables, external tables, and views.
  • For multicloud customers, data from other clouds isn't available.
  • Data insights doesn't support Geo or JSON column types.
  • Insights runs don't guarantee the presentation of queries every time. To increase the likelihood of generating more engaging queries, re-initiate the insights pipeline.
  • For tables with column-level access control (ACLs) and restricted user permissions, you can generate insights if you have read access to all columns of the table. To run the generated queries, you must have sufficient permissions.

What's next