The following sections describe how to create, load, manage, and query managed tables using BigLake tables for Apache Iceberg in BigQuery (hereafter BigLake Iceberg tables in BigQuery).
Before you begin
Understand the different types of BigLake tables and the implications of using them, in the BigLake table overview.
Before creating and using BigLake Iceberg tables in BigQuery, ensure that you have set up a Cloud resource connection to a storage bucket. Your connection needs write permissions on the storage bucket, as specified in the following Required roles section. For more information about required roles and permissions for connections, see Manage connections.
Required roles
To get the permissions that you need to let BigQuery manage tables in your project, ask your administrator to grant you the following IAM roles:
-
To create BigLake Iceberg tables in BigQuery:
-
BigQuery Data Owner (
roles/bigquery.dataOwner) on your project -
BigQuery Connection Admin (
roles/bigquery.connectionAdmin) on your project
-
BigQuery Data Owner (
-
To query BigLake Iceberg tables in BigQuery:
-
BigQuery Data Viewer (
roles/bigquery.dataViewer) on your project -
BigQuery User (
roles/bigquery.user) on your project
-
BigQuery Data Viewer (
-
Grant the connection service account the following roles so it can read and write data in Cloud Storage:
-
Storage Object User (
roles/storage.objectUser) on the bucket -
Storage Legacy Bucket Reader (
roles/storage.legacyBucketReader) on the bucket
-
Storage Object User (
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to let BigQuery manage tables in your project. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to let BigQuery manage tables in your project:
-
bigquery.connections.delegateon your project -
bigquery.jobs.createon your project -
bigquery.readsessions.createon your project -
bigquery.tables.createon your project -
bigquery.tables.geton your project -
bigquery.tables.getDataon your project -
storage.buckets.geton your bucket -
storage.objects.createon your bucket -
storage.objects.deleteon your bucket -
storage.objects.geton your bucket -
storage.objects.liston your bucket
You might also be able to get these permissions with custom roles or other predefined roles.
Create BigLake Iceberg tables in BigQuery
To create a BigLake Iceberg table in BigQuery, select one of the following methods:
SQL
CREATE TABLE [PROJECT_ID.]DATASET_ID.TABLE_NAME ( COLUMN DATA_TYPE[, ...] ) CLUSTER BY CLUSTER_COLUMN_LIST WITH CONNECTION {CONNECTION_NAME | DEFAULT} OPTIONS ( file_format = 'PARQUET', table_format = 'ICEBERG', storage_uri = 'STORAGE_URI');
Replace the following:
- PROJECT_ID: the project containing the dataset. If undefined, the command assumes the default project.
- DATASET_ID: an existing dataset.
- TABLE_NAME: the name of the table you're creating.
- DATA_TYPE: the data type of the information that is contained in the column.
- CLUSTER_COLUMN_LIST (optional): a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
CONNECTION_NAME: the name of the connection. For example,
myproject.us.myconnection.To use a default connection, specify
DEFAULTinstead of the connection string containing PROJECT_ID.REGION.CONNECTION_ID.STORAGE_URI: a fully qualified Cloud Storage URI. For example,
gs://mybucket/table.
bq
bq --project_id=PROJECT_ID mk \ --table \ --file_format=PARQUET \ --table_format=ICEBERG \ --connection_id=CONNECTION_NAME \ --storage_uri=STORAGE_URI \ --schema=COLUMN_NAME:DATA_TYPE[, ...] \ --clustering_fields=CLUSTER_COLUMN_LIST \ DATASET_ID.MANAGED_TABLE_NAME
Replace the following:
- PROJECT_ID: the project containing the dataset. If undefined, the command assumes the default project.
- CONNECTION_NAME: the name of the connection. For
example,
myproject.us.myconnection. - STORAGE_URI: a fully qualified
Cloud Storage URI.
For example,
gs://mybucket/table. - COLUMN_NAME: the column name.
- DATA_TYPE: the data type of the information contained in the column.
- CLUSTER_COLUMN_LIST (optional): a comma-separated list containing up to four columns. They must be top-level, non-repeated columns.
- DATASET_ID: the ID of an existing dataset.
- MANAGED_TABLE_NAME: the name of the table you're creating.
API
Call the
tables.insert'
method with a defined
table resource, similar to the
following:
{ "tableReference": { "tableId": "TABLE_NAME" }, "biglakeConfiguration": { "connectionId": "CONNECTION_NAME", "fileFormat": "PARQUET", "tableFormat": "ICEBERG", "storageUri": "STORAGE_URI" }, "schema": { "fields": [ { "name": "COLUMN_NAME", "type": "DATA_TYPE" } [, ...] ] } }
Replace the following:
- TABLE_NAME: the name of the table that you're creating.
- CONNECTION_NAME: the name of the connection. For
example,
myproject.us.myconnection. - STORAGE_URI: a fully qualified
Cloud Storage URI.
Wildcards
are also supported. For example,
gs://mybucket/table. - COLUMN_NAME: the column name.
- DATA_TYPE: the data type of the information contained in the column.
Import data into BigLake Iceberg tables in BigQuery
The following sections describe how to import data from various table formats into BigLake Iceberg tables in BigQuery.
Standard load data from flat files
BigLake Iceberg tables in BigQuery use BigQuery load jobs to
load external files
into BigLake Iceberg tables in BigQuery. If you have an existing
BigLake Iceberg table in BigQuery, follow
the bq load CLI guide
or the
LOAD SQL guide
to load external data. After loading the data, new Parquet files are written
into the STORAGE_URI/data folder.
If the prior instructions are used without an existing BigLake Iceberg table in BigQuery, a BigQuery table is created instead.
See the following for tool-specific examples of batch loads into managed tables:
SQL
LOAD DATA INTO MANAGED_TABLE_NAME FROM FILES ( uris=['STORAGE_URI'], format='FILE_FORMAT');
Replace the following:
- MANAGED_TABLE_NAME: the name of an existing BigLake Iceberg table in BigQuery.
- STORAGE_URI: a fully qualified
Cloud Storage URI
or a comma-separated list of URIs.
Wildcards
are also supported. For example,
gs://mybucket/table. - FILE_FORMAT: the source table format. For supported formats,
see the
formatrow ofload_option_list.
bq
bq load \ --source_format=FILE_FORMAT \ MANAGED_TABLE \ STORAGE_URI
Replace the following:
- FILE_FORMAT: the source table format. For supported formats,
see the
formatrow ofload_option_list. - MANAGED_TABLE_NAME: the name of an existing BigLake Iceberg table in BigQuery.
- STORAGE_URI: a fully qualified
Cloud Storage URI
or a comma-separated list of URIs.
Wildcards
are also supported. For example,
gs://mybucket/table.
Standard load from Hive-partitioned files
You can load Hive-partitioned files into BigLake Iceberg tables in BigQuery using standard BigQuery load jobs. For more information, see Loading externally partitioned data.
Load streaming data from Pub/Sub
You can load streaming data into BigLake Iceberg tables in BigQuery by using a Pub/Sub BigQuery subscription.
Export data from BigLake Iceberg tables in BigQuery
The following sections describe how to export data from BigLake Iceberg tables in BigQuery into various table formats.
Export data into flat formats
To export a BigLake Iceberg table in BigQuery into a flat format, use the
EXPORT DATA statement
and select a destination format. For more information, see
Exporting data.
Create BigLake Iceberg table in BigQuery metadata snapshots
To create a BigLake Iceberg table in BigQuery metadata snapshot, follow these steps:
Export the metadata into the Iceberg V2 format with the
EXPORT TABLE METADATASQL statement.Optional: Schedule Iceberg metadata snapshot refresh. To refresh an Iceberg metadata snapshot based on a set time interval, use a scheduled query.
Optional: Enable metadata auto-refresh for your project to automatically update your Iceberg table metadata snapshot on each table mutation. To enable metadata auto-refresh, contact bigquery-tables-for-apache-iceberg-help@google.com.
EXPORT METADATAcosts are applied on each refresh operation.
The following example creates a scheduled query named My Scheduled Snapshot Refresh Query
using the DDL statement EXPORT TABLE METADATA FROM mydataset.test. The DDL
statement runs every 24 hours.
bq query \ --use_legacy_sql=false \ --display_name='My Scheduled Snapshot Refresh Query' \ --schedule='every 24 hours' \ 'EXPORT TABLE METADATA FROM mydataset.test'
View BigLake Iceberg table in BigQuery metadata snapshot
After you refresh the BigLake Iceberg table in BigQuery metadata snapshot you can
find the snapshot in the Cloud Storage URI
that the BigLake Iceberg table in BigQuery was originally created in. The /data
folder contains the Parquet file data shards, and the /metadata folder contains
the BigLake Iceberg table in BigQuery metadata snapshot.
SELECT table_name, REGEXP_EXTRACT(ddl, r"storage_uri\s*=\s*\"([^\"]+)\"") AS storage_uri FROM `mydataset`.INFORMATION_SCHEMA.TABLES;
Note that mydataset and table_name are placeholders for your actual dataset and table.
Read BigLake Iceberg tables in BigQuery with Apache Spark
The following sample sets up your environment to use Spark SQL with Apache Iceberg, and then executes a query to fetch data from a specified BigLake Iceberg table in BigQuery.
spark-sql \ --packages org.apache.iceberg:iceberg-spark-runtime-ICEBERG_VERSION_NUMBER \ --conf spark.sql.catalog.CATALOG_NAME=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.CATALOG_NAME.type=hadoop \ --conf spark.sql.catalog.CATALOG_NAME.warehouse='BUCKET_PATH' \ # Query the table SELECT * FROM CATALOG_NAME.FOLDER_NAME;
Replace the following:
- ICEBERG_VERSION_NUMBER: the current version of Apache Spark Iceberg runtime. Download the latest version from Spark Releases.
- CATALOG_NAME: the catalog to reference your BigLake Iceberg table in BigQuery.
- BUCKET_PATH: the path to the bucket containing the
table files. For example,
gs://mybucket/. - FOLDER_NAME: the folder containing the table files.
For example,
myfolder.
Modify BigLake Iceberg tables in BigQuery
To modify a BigLake Iceberg table in BigQuery, follow the steps shown in Modifying table schemas.
Use multi-statement transactions
To gain access to multi-statement transactions for BigLake Iceberg tables in BigQuery, fill out the sign-up form.
Use partitioning
To gain access to partitioning for BigLake Iceberg tables in BigQuery, fill out the sign-up form.
You partition a table by specifying a partition column, which is used to segment the table. The following column types are supported for BigLake Iceberg tables in BigQuery:
DATEDATETIMETIMESTAMP
Partitioning a table on a DATE, DATETIME, or TIMESTAMP column is known as
time-unit column partitioning.
You choose whether the partitions have
hourly, daily, monthly, or yearly granularity.
BigLake Iceberg tables in BigQuery also support clustering and combining clustered and partitioned tables.
Partitioning limitations
- All BigQuery partitioned table limitations apply.
- Partitioning column types other than
DATE,DATETIME, orTIMESTAMParen't supported. - Partition expiration isn't supported.
- Partition evolution isn't supported.
Create a partitioned BigLake Iceberg table in BigQuery
To create a partitioned BigLake Iceberg table in BigQuery, follow the instructions to create a standard BigLake Iceberg table in BigQuery, and include one of the following, depending on your environment:
- The
PARTITION BYclause - The
--time_partitioning_fieldand--time_partitioning_typeflags - The
timePartitioningproperty
Modify and query partitioned BigLake Iceberg tables in BigQuery
BigQuery data manipulation language (DML) statements and queries for partitioned BigLake Iceberg tables in BigQuery are the same as for standard BigLake Iceberg tables in BigQuery. BigQuery automatically scopes the job to the right partitions, similar to Iceberg hidden partitioning. Additionally, any new data that you add to the table is automatically partitioned.
You can also query partitioned BigLake Iceberg tables in BigQuery with other engines in the same way as standard BigLake Iceberg tables in BigQuery. We recommend enabling metadata snapshots for the best experience.
For enhanced security, partitioning information for BigLake Iceberg tables in BigQuery is decoupled from the data path and is managed entirely by the metadata layer.