Manage BigLake tables for Apache Iceberg in BigQuery in BigQuery

The following sections describe how to create, load, manage, and query managed tables using BigLake tables for Apache Iceberg in BigQuery (hereafter BigLake Iceberg tables in BigQuery).

Before you begin

  • Understand the different types of BigLake tables and the implications of using them, in the BigLake table overview.

  • Before creating and using BigLake Iceberg tables in BigQuery, ensure that you have set up a Cloud resource connection to a storage bucket. Your connection needs write permissions on the storage bucket, as specified in the following Required roles section. For more information about required roles and permissions for connections, see Manage connections.

Required roles

To get the permissions that you need to let BigQuery manage tables in your project, ask your administrator to grant you the following IAM roles:

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to let BigQuery manage tables in your project. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to let BigQuery manage tables in your project:

  • bigquery.connections.delegate on your project
  • bigquery.jobs.create on your project
  • bigquery.readsessions.create on your project
  • bigquery.tables.create on your project
  • bigquery.tables.get on your project
  • bigquery.tables.getData on your project
  • storage.buckets.get on your bucket
  • storage.objects.create on your bucket
  • storage.objects.delete on your bucket
  • storage.objects.get on your bucket
  • storage.objects.list on your bucket

You might also be able to get these permissions with custom roles or other predefined roles.

Create BigLake Iceberg tables in BigQuery

To create a BigLake Iceberg table in BigQuery, select one of the following methods:

SQL

CREATE TABLE [PROJECT_ID.]DATASET_ID.TABLE_NAME (
COLUMN DATA_TYPE[, ...]
)
CLUSTER BY CLUSTER_COLUMN_LIST
WITH CONNECTION {CONNECTION_NAME | DEFAULT}
OPTIONS (
file_format = 'PARQUET',
table_format = 'ICEBERG',
storage_uri = 'STORAGE_URI');

Replace the following:

  • PROJECT_ID: the project containing the dataset. If undefined, the command assumes the default project.
  • DATASET_ID: an existing dataset.
  • TABLE_NAME: the name of the table you're creating.
  • DATA_TYPE: the data type of the information that is contained in the column.
  • CLUSTER_COLUMN_LIST (optional): a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
  • CONNECTION_NAME: the name of the connection. For example, myproject.us.myconnection.

    To use a default connection, specify DEFAULT instead of the connection string containing PROJECT_ID.REGION.CONNECTION_ID.

  • STORAGE_URI: a fully qualified Cloud Storage URI. For example, gs://mybucket/table.

bq

bq --project_id=PROJECT_ID mk \
    --table \
    --file_format=PARQUET \
    --table_format=ICEBERG \
    --connection_id=CONNECTION_NAME \
    --storage_uri=STORAGE_URI \
    --schema=COLUMN_NAME:DATA_TYPE[, ...] \
    --clustering_fields=CLUSTER_COLUMN_LIST \
    DATASET_ID.MANAGED_TABLE_NAME

Replace the following:

  • PROJECT_ID: the project containing the dataset. If undefined, the command assumes the default project.
  • CONNECTION_NAME: the name of the connection. For example, myproject.us.myconnection.
  • STORAGE_URI: a fully qualified Cloud Storage URI. For example, gs://mybucket/table.
  • COLUMN_NAME: the column name.
  • DATA_TYPE: the data type of the information contained in the column.
  • CLUSTER_COLUMN_LIST (optional): a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
  • DATASET_ID: the ID of an existing dataset.
  • MANAGED_TABLE_NAME: the name of the table you're creating.

API

Call the tables.insert' method with a defined table resource, similar to the following:

{
"tableReference": {
  "tableId": "TABLE_NAME"
},
"biglakeConfiguration": {
  "connectionId": "CONNECTION_NAME",
  "fileFormat": "PARQUET",
  "tableFormat": "ICEBERG",
  "storageUri": "STORAGE_URI"
},
"schema": {
  "fields": [
    {
      "name": "COLUMN_NAME",
      "type": "DATA_TYPE"
    }
    [, ...]
  ]
}
}

Replace the following:

  • TABLE_NAME: the name of the table that you're creating.
  • CONNECTION_NAME: the name of the connection. For example, myproject.us.myconnection.
  • STORAGE_URI: a fully qualified Cloud Storage URI. Wildcards are also supported. For example, gs://mybucket/table.
  • COLUMN_NAME: the column name.
  • DATA_TYPE: the data type of the information contained in the column.

Import data into BigLake Iceberg tables in BigQuery

The following sections describe how to import data from various table formats into BigLake Iceberg tables in BigQuery.

Standard load data from flat files

BigLake Iceberg tables in BigQuery use BigQuery load jobs to load external files into BigLake Iceberg tables in BigQuery. If you have an existing BigLake Iceberg table in BigQuery, follow the bq load CLI guide or the LOAD SQL guide to load external data. After loading the data, new Parquet files are written into the STORAGE_URI/data folder.

If the prior instructions are used without an existing BigLake Iceberg table in BigQuery, a BigQuery table is created instead.

See the following for tool-specific examples of batch loads into managed tables:

SQL

LOAD DATA INTO MANAGED_TABLE_NAME
FROM FILES (
uris=['STORAGE_URI'],
format='FILE_FORMAT');

Replace the following:

  • MANAGED_TABLE_NAME: the name of an existing BigLake Iceberg table in BigQuery.
  • STORAGE_URI: a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported. For example, gs://mybucket/table.
  • FILE_FORMAT: the source table format. For supported formats, see the format row of load_option_list.

bq

bq load \
  --source_format=FILE_FORMAT \
  MANAGED_TABLE \
  STORAGE_URI

Replace the following:

  • FILE_FORMAT: the source table format. For supported formats, see the format row of load_option_list.
  • MANAGED_TABLE_NAME: the name of an existing BigLake Iceberg table in BigQuery.
  • STORAGE_URI: a fully qualified Cloud Storage URI or a comma-separated list of URIs. Wildcards are also supported. For example, gs://mybucket/table.

Standard load from Hive-partitioned files

You can load Hive-partitioned files into BigLake Iceberg tables in BigQuery using standard BigQuery load jobs. For more information, see Loading externally partitioned data.

Load streaming data from Pub/Sub

You can load streaming data into BigLake Iceberg tables in BigQuery by using a Pub/Sub BigQuery subscription.

Export data from BigLake Iceberg tables in BigQuery

The following sections describe how to export data from BigLake Iceberg tables in BigQuery into various table formats.

Export data into flat formats

To export a BigLake Iceberg table in BigQuery into a flat format, use the EXPORT DATA statement and select a destination format. For more information, see Exporting data.

Create BigLake Iceberg table in BigQuery metadata snapshots

To create a BigLake Iceberg table in BigQuery metadata snapshot, follow these steps:

  1. Export the metadata into the Iceberg V2 format with the EXPORT TABLE METADATA SQL statement.

  2. Optional: Schedule Iceberg metadata snapshot refresh. To refresh an Iceberg metadata snapshot based on a set time interval, use a scheduled query.

  3. Optional: Enable metadata auto-refresh for your project to automatically update your Iceberg table metadata snapshot on each table mutation. To enable metadata auto-refresh, contact bigquery-tables-for-apache-iceberg-help@google.com. EXPORT METADATA costs are applied on each refresh operation.

The following example creates a scheduled query named My Scheduled Snapshot Refresh Query using the DDL statement EXPORT TABLE METADATA FROM mydataset.test. The DDL statement runs every 24 hours.

bq query \
    --use_legacy_sql=false \
    --display_name='My Scheduled Snapshot Refresh Query' \
    --schedule='every 24 hours' \
    'EXPORT TABLE METADATA FROM mydataset.test'

View BigLake Iceberg table in BigQuery metadata snapshot

After you refresh the BigLake Iceberg table in BigQuery metadata snapshot you can find the snapshot in the Cloud Storage URI that the BigLake Iceberg table in BigQuery was originally created in. The /data folder contains the Parquet file data shards, and the /metadata folder contains the BigLake Iceberg table in BigQuery metadata snapshot.

SELECT
  table_name,
  REGEXP_EXTRACT(ddl, r"storage_uri\s*=\s*\"([^\"]+)\"") AS storage_uri
FROM
  `mydataset`.INFORMATION_SCHEMA.TABLES;

Note that mydataset and table_name are placeholders for your actual dataset and table.

Read BigLake Iceberg tables in BigQuery with Apache Spark

The following sample sets up your environment to use Spark SQL with Apache Iceberg, and then executes a query to fetch data from a specified BigLake Iceberg table in BigQuery.

spark-sql \
  --packages org.apache.iceberg:iceberg-spark-runtime-ICEBERG_VERSION_NUMBER \
  --conf spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.CATALOG_NAME.type=hadoop \
  --conf spark.sql.catalog.CATALOG_NAME.warehouse='BUCKET_PATH' \

# Query the table
SELECT * FROM CATALOG_NAME.FOLDER_NAME;

Replace the following:

  • ICEBERG_VERSION_NUMBER: the current version of Apache Spark Iceberg runtime. Download the latest version from Spark Releases.
  • CATALOG_NAME: the catalog to reference your BigLake Iceberg table in BigQuery.
  • BUCKET_PATH: the path to the bucket containing the table files. For example, gs://mybucket/.
  • FOLDER_NAME: the folder containing the table files. For example, myfolder.

Modify BigLake Iceberg tables in BigQuery

To modify a BigLake Iceberg table in BigQuery, follow the steps shown in Modifying table schemas.

Use multi-statement transactions

To gain access to multi-statement transactions for BigLake Iceberg tables in BigQuery, fill out the sign-up form.

Use partitioning

To gain access to partitioning for BigLake Iceberg tables in BigQuery, fill out the sign-up form.

You partition a table by specifying a partition column, which is used to segment the table. The following column types are supported for BigLake Iceberg tables in BigQuery:

  • DATE
  • DATETIME
  • TIMESTAMP

Partitioning a table on a DATE, DATETIME, or TIMESTAMP column is known as time-unit column partitioning. You choose whether the partitions have hourly, daily, monthly, or yearly granularity.

BigLake Iceberg tables in BigQuery also support clustering and combining clustered and partitioned tables.

Partitioning limitations

Create a partitioned BigLake Iceberg table in BigQuery

To create a partitioned BigLake Iceberg table in BigQuery, follow the instructions to create a standard BigLake Iceberg table in BigQuery, and include one of the following, depending on your environment:

Modify and query partitioned BigLake Iceberg tables in BigQuery

BigQuery data manipulation language (DML) statements and queries for partitioned BigLake Iceberg tables in BigQuery are the same as for standard BigLake Iceberg tables in BigQuery. BigQuery automatically scopes the job to the right partitions, similar to Iceberg hidden partitioning. Additionally, any new data that you add to the table is automatically partitioned.

You can also query partitioned BigLake Iceberg tables in BigQuery with other engines in the same way as standard BigLake Iceberg tables in BigQuery. We recommend enabling metadata snapshots for the best experience.

For enhanced security, partitioning information for BigLake Iceberg tables in BigQuery is decoupled from the data path and is managed entirely by the metadata layer.