Saturday, March 25, 2023
HomeBig DataUse Apache Iceberg in an information lake to help incremental knowledge processing

Use Apache Iceberg in an information lake to help incremental knowledge processing

Apache Iceberg is an open desk format for very massive analytic datasets, which captures metadata info on the state of datasets as they evolve and alter over time. It provides tables to compute engines together with Spark, Trino, PrestoDB, Flink, and Hive utilizing a high-performance desk format that works similar to a SQL desk. Iceberg has develop into very fashionable for its help for ACID transactions in knowledge lakes and options like schema and partition evolution, time journey, and rollback.

Apache Iceberg integration is supported by AWS analytics companies together with Amazon EMR, Amazon Athena, and AWS Glue. Amazon EMR can provision clusters with Spark, Hive, Trino, and Flink that may run Iceberg. Beginning with Amazon EMR model 6.5.0, you may use Iceberg along with your EMR cluster with out requiring a bootstrap motion. In early 2022, AWS introduced basic availability of Athena ACID transactions, powered by Apache Iceberg. The just lately launched Athena question engine model 3 offers higher integration with the Iceberg desk format. AWS Glue 3.0 and later helps the Apache Iceberg framework for knowledge lakes.

On this submit, we focus on what clients need in fashionable knowledge lakes and the way Apache Iceberg helps handle buyer wants. Then we stroll by means of an answer to construct a high-performance and evolving Iceberg knowledge lake on Amazon Easy Storage Service (Amazon S3) and course of incremental knowledge by working insert, replace, and delete SQL statements. Lastly, we present you how you can efficiency tune the method to enhance learn and write efficiency.

How Apache Iceberg addresses what clients need in fashionable knowledge lakes

Increasingly clients are constructing knowledge lakes, with structured and unstructured knowledge, to help many customers, functions, and analytics instruments. There may be an elevated want for knowledge lakes to help database like options comparable to ACID transactions, record-level updates and deletes, time journey, and rollback. Apache Iceberg is designed to help these options on cost-effective petabyte-scale knowledge lakes on Amazon S3.

Apache Iceberg addresses buyer wants by capturing wealthy metadata details about the dataset on the time the person knowledge recordsdata are created. There are three layers within the structure of an Iceberg desk: the Iceberg catalog, the metadata layer, and the information layer, as depicted within the following determine (supply).

The Iceberg catalog shops the metadata pointer to the present desk metadata file. When a choose question is studying an Iceberg desk, the question engine first goes to the Iceberg catalog, then retrieves the placement of the present metadata file. Every time there may be an replace to the Iceberg desk, a brand new snapshot of the desk is created, and the metadata pointer factors to the present desk metadata file.

The next is an instance Iceberg catalog with AWS Glue implementation. You may see the database title, the placement (S3 path) of the Iceberg desk, and the metadata location.

The metadata layer has three kinds of recordsdata: the metadata file, manifest listing, and manifest file in a hierarchy. On the prime of the hierarchy is the metadata file, which shops details about the desk’s schema, partition info, and snapshots. The snapshot factors to the manifest listing. The manifest listing has the details about every manifest file that makes up the snapshot, comparable to location of the manifest file, the partitions it belongs to, and the decrease and higher bounds for partition columns for the information recordsdata it tracks. The manifest file tracks knowledge recordsdata in addition to further particulars about every file, such because the file format. All three recordsdata work in a hierarchy to trace the snapshots, schema, partitioning, properties, and knowledge recordsdata in an Iceberg desk.

The information layer has the person knowledge recordsdata of the Iceberg desk. Iceberg helps a variety of file codecs together with Parquet, ORC, and Avro. As a result of the Iceberg desk tracks the person knowledge recordsdata as a substitute of solely pointing to the partition location with knowledge recordsdata, it isolates the writing operations from studying operations. You may write the information recordsdata at any time, however solely commit the change explicitly, which creates a brand new model of the snapshot and metadata recordsdata.

Resolution overview

On this submit, we stroll you thru an answer to construct a high-performing Apache Iceberg knowledge lake on Amazon S3; course of incremental knowledge with insert, replace, and delete SQL statements; and tune the Iceberg desk to enhance learn and write efficiency. The next diagram illustrates the answer structure.

To exhibit this answer, we use the Amazon Buyer Opinions dataset in an S3 bucket (s3://amazon-reviews-pds/parquet/). In actual use case, it could be uncooked knowledge saved in your S3 bucket. We are able to examine the information measurement with the next code within the AWS Command Line Interface (AWS CLI):

//Run this AWS CLI command to examine the information measurement
aws s3 ls --summarize --human-readable --recursive s3://amazon-reviews-pds/parquet

The overall object depend is 430, and whole measurement is 47.4 GiB.

To arrange and check this answer, we full the next high-level steps:

  1. Arrange an S3 bucket within the curated zone to retailer transformed knowledge in Iceberg desk format.
  2. Launch an EMR cluster with applicable configurations for Apache Iceberg.
  3. Create a pocket book in EMR Studio.
  4. Configure the Spark session for Apache Iceberg.
  5. Convert knowledge to Iceberg desk format and transfer knowledge to the curated zone.
  6. Run insert, replace, and delete queries in Athena to course of incremental knowledge.
  7. Perform efficiency tuning.


To comply with together with this walkthrough, it’s essential to have an AWS account with an AWS Id and Entry Administration (IAM) position that has enough entry to provision the required assets.

Arrange the S3 bucket for Iceberg knowledge within the curated zone in your knowledge lake

Select the Area wherein you wish to create the S3 bucket and supply a singular title:


Launch an EMR cluster to run Iceberg jobs utilizing Spark

You may create an EMR cluster from the AWS Administration Console, Amazon EMR CLI, or AWS Cloud Improvement Equipment (AWS CDK). For this submit, we stroll you thru how you can create an EMR cluster from the console.

  1. On the Amazon EMR console, select Create cluster.
  2. Select Superior choices.
  3. For Software program Configuration, select the most recent Amazon EMR launch. As of January 2023, the most recent launch is 6.9.0. Iceberg requires launch 6.5.0 and above.
  4. Choose JupyterEnterpriseGateway and Spark because the software program to put in.
  5. For Edit software program settings, choose Enter configuration and enter [{"classification":"iceberg-defaults","properties":{"iceberg.enabled":true}}].
  6. Go away different settings at their default and select Subsequent.
  7. For {Hardware}, use the default setting.
  8. Select Subsequent.
  9. For Cluster title, enter a reputation. We use iceberg-blog-cluster.
  10. Go away the remaining settings unchanged and select Subsequent.
  11. Select Create cluster.

Create a pocket book in EMR Studio

We now stroll you thru how you can create a pocket book in EMR Studio from the console.

  1. On the IAM console, create an EMR Studio service position.
  2. On the Amazon EMR console, select EMR Studio.
  3. Select Get began.

The Get began web page seems in a brand new tab.

  1. Select Create Studio within the new tab.
  2. Enter a reputation. We use iceberg-studio.
  3. Select the identical VPC and subnet as these for the EMR cluster, and the default safety group.
  4. Select AWS Id and Entry Administration (IAM) for authentication, and select the EMR Studio service position you simply created.
  5. Select an S3 path for Workspaces backup.
  6. Select Create Studio.
  7. After the Studio is created, select the Studio entry URL.
  8. On the EMR Studio dashboard, select Create workspace.
  9. Enter a reputation on your Workspace. We use iceberg-workspace.
  10. Broaden Superior configuration and select Connect Workspace to an EMR cluster.
  11. Select the EMR cluster you created earlier.
  12. Select Create Workspace.
  13. Select the Workspace title to open a brand new tab.

Within the navigation pane, there’s a pocket book that has the identical title because the Workspace. In our case, it’s iceberg-workspace.

  1. Open the pocket book.
  2. When prompted to decide on a kernel, select Spark.

Configure a Spark session for Apache Iceberg

Use the next code, offering your individual S3 bucket title:

%%configure -f
"conf": {
"spark.sql.catalog.demo": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.demo.catalog-impl": "",
"spark.sql.catalog.demo.warehouse": "s3://iceberg-curated-blog-data",

This units the next Spark session configurations:

  • spark.sql.catalog.demo – Registers a Spark catalog named demo, which makes use of the Iceberg Spark catalog plugin.
  • spark.sql.catalog.demo.catalog-impl – The demo Spark catalog makes use of AWS Glue because the bodily catalog to retailer Iceberg database and desk info.
  • spark.sql.catalog.demo.warehouse – The demo Spark catalog shops all Iceberg metadata and knowledge recordsdata below the basis path outlined by this property: s3://iceberg-curated-blog-data.
  • spark.sql.extensions – Provides help to Iceberg Spark SQL extensions, which lets you run Iceberg Spark procedures and a few Iceberg-only SQL instructions (you utilize this in a later step).
  • – Iceberg permits customers to write down knowledge to Amazon S3 by means of S3FileIO. The AWS Glue Information Catalog by default makes use of this FileIO, and different catalogs can load this FileIO utilizing the io-impl catalog property.

Convert knowledge to Iceberg desk format

You should utilize both Spark on Amazon EMR or Athena to load the Iceberg desk. Within the EMR Studio Workspace pocket book Spark session, run the next instructions to load the information:

// create a database in AWS Glue named opinions if not exist
spark.sql("CREATE DATABASE IF NOT EXISTS demo.opinions")

// load opinions - this load all of the parquet recordsdata
val reviews_all_location = "s3://amazon-reviews-pds/parquet/"
val reviews_all = spark.learn.parquet(reviews_all_location)

// write opinions knowledge to an Iceberg v2 desk
reviews_all.writeTo("demo.opinions.all_reviews").tableProperty("format-version", "2").createOrReplace()

After you run the code, you need to discover two prefixes created in your knowledge warehouse S3 path (s3://iceberg-curated-blog-data/opinions.db/all_reviews): knowledge and metadata.

Course of incremental knowledge utilizing insert, replace, and delete SQL statements in Athena

Athena is a serverless question engine that you should use to carry out learn, write, replace, and optimization duties in opposition to Iceberg tables. To exhibit how the Apache Iceberg knowledge lake format helps incremental knowledge ingestion, we run insert, replace, and delete SQL statements on the information lake.

Navigate to the Athena console and select Question editor. If that is your first time utilizing the Athena question editor, it is advisable configure the question outcome location to be the S3 bucket you created earlier. It is best to be capable of see that the desk opinions.all_reviews is accessible for querying. Run the next question to confirm that you’ve loaded the Iceberg desk efficiently:

choose * from opinions.all_reviews restrict 5;

Course of incremental knowledge by working insert, replace, and delete SQL statements:

//Instance replace assertion
replace opinions.all_reviews set star_rating=5 the place product_category = 'Watches' and star_rating=4

//Instance delete assertion
delete from opinions.all_reviews the place product_category = 'Watches' and star_rating=1

Efficiency tuning

On this part, we stroll by means of alternative ways to enhance Apache Iceberg learn and write efficiency.

Configure Apache Iceberg desk properties

Apache Iceberg is a desk format, and it helps desk properties to configure desk conduct comparable to learn, write, and catalog. You may enhance the learn and write efficiency on Iceberg tables by adjusting the desk properties.

For instance, if you happen to discover that you simply write too many small recordsdata for an Iceberg desk, you may config the write file measurement to write down fewer however larger measurement recordsdata, to assist enhance question efficiency.

Property Default Description 536870912 (512 MB) Controls the dimensions of recordsdata generated to focus on about this many bytes

Use the next code to change the desk format:

//Instance code to change desk format in EMR Studio Workspace pocket book
spark.sql("ALTER TABLE demo.opinions.all_reviews 
SET TBLPROPERTIES ('write_target_data_file_size_bytes'='536870912')")

Partitioning and sorting

To make a question run quick, the much less knowledge learn the higher. Iceberg takes benefit of the wealthy metadata it captures at write time and facilitates strategies comparable to scan planning, partitioning, pruning, and column-level stats comparable to min/max values to skip knowledge recordsdata that don’t have match information. We stroll you thru how question scan planning and partitioning work in Iceberg and the way we use them to enhance question efficiency.

Question scan planning

For a given question, step one in a question engine is scan planning, which is the method to search out the recordsdata in a desk wanted for a question. Planning in an Iceberg desk may be very environment friendly, as a result of Iceberg’s wealthy metadata can be utilized to prune metadata recordsdata that aren’t wanted, along with filtering knowledge recordsdata that don’t comprise matching knowledge. In our exams, we noticed Athena scanned 50% or much less knowledge for a given question on an Iceberg desk in comparison with authentic knowledge earlier than conversion to Iceberg format.

There are two kinds of filtering:

  • Metadata filtering – Iceberg makes use of two ranges of metadata to trace the recordsdata in a snapshot: the manifest listing and manifest recordsdata. It first makes use of the manifest listing, which acts as an index of the manifest recordsdata. Throughout planning, Iceberg filters manifests utilizing the partition worth vary within the manifest listing with out studying all of the manifest recordsdata. Then it makes use of chosen manifest recordsdata to get knowledge recordsdata.
  • Information filtering – After choosing the listing of manifest recordsdata, Iceberg makes use of the partition knowledge and column-level stats for every knowledge file saved in manifest recordsdata to filter knowledge recordsdata. Throughout planning, question predicates are transformed to predicates on the partition knowledge and utilized first to filter knowledge recordsdata. Then, the column stats like column-level worth counts, null counts, decrease bounds, and higher bounds are used to filter out knowledge recordsdata that may’t match the question predicate. By utilizing higher and decrease bounds to filter knowledge recordsdata at planning time, Iceberg drastically improves question efficiency.

Partitioning and sorting

Partitioning is a option to group information with the identical key column values collectively in writing. The good thing about partitioning is quicker queries that entry solely a part of the information, as defined earlier in question scan planning: knowledge filtering. Iceberg makes partitioning easy by supporting hidden partitioning, in the way in which that Iceberg produces partition values by taking a column worth and optionally remodeling it.

In our use case, we first run the next question on the Iceberg desk not partitioned. Then we partition the Iceberg desk by the class of the opinions, which can be used within the question WHERE situation to filter out information. With partitioning, the question might scan a lot much less knowledge. See the next code:

//Instance code in EMR Studio Workspace pocket book to create an Iceberg desk all_reviews_partitioned partitioned by product_category
reviews_all.writeTo("demo.opinions.all_reviews_partitioned").tableProperty("format-version", "2").partitionedBy($"product_category").createOrReplace()

Run the next choose assertion on the non-partitioned all_reviews desk vs. the partitioned desk to see the efficiency distinction:

//Run this question on all_reviews desk and the partitioned desk for efficiency testing
choose market,customer_id, review_id,product_id,product_title,star_rating from opinions.all_reviews the place product_category = 'Watches' and review_date between date('2005-01-01') and date('2005-03-31')

//Run the identical choose question on partitioned dataset
choose market,customer_id, review_id,product_id,product_title,star_rating from opinions.all_reviews_partitioned the place product_category = 'Watches' and review_date between date('2005-01-01') and date('2005-03-31')

The next desk exhibits the efficiency enchancment of information partitioning, with about 50% efficiency enchancment and 70% much less knowledge scanned.

Dataset Identify Non-Partitioned Dataset Partitioned Dataset
Runtime (seconds) 8.20 4.25
Information Scanned (MB) 131.55 33.79

Observe that the runtime is the common runtime with a number of runs in our check.

We noticed good efficiency enchancment after partitioning. Nonetheless, this may be additional improved by utilizing column-level stats from Iceberg manifest recordsdata. As a way to use the column-level stats successfully, you wish to additional type your information based mostly on the question patterns. Sorting the entire dataset utilizing the columns which might be typically utilized in queries will reorder the information in such a method that every knowledge file finally ends up with a singular vary of values for the precise columns. If these columns are used within the question situation, it permits question engines to additional skip knowledge recordsdata, thereby enabling even sooner queries.

Copy-on-write vs. read-on-merge

When implementing replace and delete on Iceberg tables within the knowledge lake, there are two approaches outlined by the Iceberg desk properties:

  • Copy-on-write – With this method, when there are adjustments to the Iceberg desk, both updates or deletes, the information recordsdata related to the impacted information can be duplicated and up to date. The information can be both up to date or deleted from the duplicated knowledge recordsdata. A brand new snapshot of the Iceberg desk can be created and pointing to the newer model of information recordsdata. This makes the general writes slower. There is likely to be conditions that concurrent writes are wanted with conflicts so retry has to occur, which will increase the write time much more. However, when studying the information, there is no such thing as a additional course of wanted. The question will retrieve knowledge from the most recent model of information recordsdata.
  • Merge-on-read – With this method, when there are updates or deletes on the Iceberg desk, the prevailing knowledge recordsdata won’t be rewritten; as a substitute new delete recordsdata can be created to trace the adjustments. For deletes, a brand new delete file can be created with the deleted information. When studying the Iceberg desk, the delete file can be utilized to the retrieved knowledge to filter out the delete information. For updates, a brand new delete file can be created to mark the up to date information as deleted. Then a brand new file can be created for these information however with up to date values. When studying the Iceberg desk, each the delete and new recordsdata can be utilized to the retrieved knowledge to mirror the most recent adjustments and produce the right outcomes. So, for any subsequent queries, an additional step to merge the information recordsdata with the delete and new recordsdata will occur, which can often enhance the question time. However, the writes is likely to be sooner as a result of there is no such thing as a must rewrite the prevailing knowledge recordsdata.

To check the affect of the 2 approaches, you may run the next code to set the Iceberg desk properties:

//Run code to change Iceberg desk property to set copy-on-write and merge-on-read in EMR Studio Workspace pocket book
spark.sql(“ALTER TABLE demo.opinions.all_reviews 
SET TBLPROPERTIES (‘write.delete.mode’=’copy-on-write’,’write.replace.mode’=’copy-on-write’)”)

Run the replace, delete, and choose SQL statements in Athena to indicate the runtime distinction for copy-on-write vs. merge-on-read:

//Instance replace assertion
replace opinions.all_reviews set star_rating=5 the place product_category = ‘Watches’ and star_rating=4

//Instance delete assertion
delete from opinions.all_reviews the place product_category = ‘Watches’ and star_rating=1

//Instance choose assertion
choose market,customer_id, review_id,product_id,product_title,star_rating from opinions.all_reviews the place product_category = ‘Watches’ and review_date between date(‘2005-01-01’) and date(‘2005-03-31’)

The next desk summarizes the question runtimes.

Question Copy-on-Write Merge-on-Learn
Runtime (seconds) 66.251 116.174 97.75 10.788 54.941 113.44
Information scanned (MB) 494.06 3.07 137.16 494.06 3.07 137.16

Observe that the runtime is the common runtime with a number of runs in our check.

As our check outcomes present, there are all the time trade-offs within the two approaches. Which method to make use of is determined by your use instances. In abstract, the issues come all the way down to latency on the learn vs. write. You may reference the next desk and make the proper alternative.

. Copy-on-Write Merge-on-Learn
Execs Quicker reads Quicker writes
Cons Costly writes Increased latency on reads
When to make use of Good for frequent reads, rare updates and deletes or massive batch updates Good for tables with frequent updates and deletes

Information compaction

In case your knowledge file measurement is small, you would possibly find yourself with 1000’s or tens of millions of recordsdata in an Iceberg desk. This dramatically will increase the I/O operation and slows down the queries. Moreover, Iceberg tracks every knowledge file in a dataset. Extra knowledge recordsdata result in extra metadata. This in flip will increase the overhead and I/O operation on studying metadata recordsdata. As a way to enhance the question efficiency, it’s advisable to compact small knowledge recordsdata to bigger knowledge recordsdata.

When updating and deleting information in Iceberg desk, if the read-on-merge method is used, you would possibly find yourself with many small deletes or new knowledge recordsdata. Operating compaction will mix all these recordsdata and create a more recent model of the information file. This eliminates the necessity to reconcile them throughout reads. It’s advisable to have common compaction jobs to affect reads as little as attainable whereas nonetheless sustaining sooner write pace.

Run the next knowledge compaction command, then run the choose question from Athena:

//Information compaction 
optimize opinions.all_reviews REWRITE DATA USING BIN_PACK

//Run this question earlier than and after knowledge compaction
choose market,customer_id, review_id,product_id,product_title,star_rating from opinions.all_reviews the place product_category = 'Watches' and review_date between date('2005-01-01') and date('2005-03-31')

The next desk compares the runtime earlier than vs. after knowledge compaction. You may see about 40% efficiency enchancment.

Question Earlier than Information Compaction After Information Compaction
Runtime (seconds) 97.75 32.676 seconds
Information scanned (MB) 137.16 M 189.19 M

Observe that the choose queries ran on the all_reviews desk after replace and delete operations, earlier than and after knowledge compaction. The runtime is the common runtime with a number of runs in our check.

Clear up

After you comply with the answer walkthrough to carry out the use instances, full the next steps to scrub up your assets and keep away from additional prices:

  1. Drop the AWS Glue tables and database from Athena or run the next code in your pocket book:
// DROP the desk 
spark.sql("DROP TABLE demo.opinions.all_reviews") 
spark.sql("DROP TABLE demo.opinions.all_reviews_partitioned") 

// DROP the database 
spark.sql("DROP DATABASE demo.opinions")

  1. On the EMR Studio console, select Workspaces within the navigation pane.
  2. Choose the Workspace you created and select Delete.
  3. On the EMR console, navigate to the Studios web page.
  4. Choose the Studio you created and select Delete.
  5. On the EMR console, select Clusters within the navigation pane.
  6. Choose the cluster and select Terminate.
  7. Delete the S3 bucket and another assets that you simply created as a part of the conditions for this submit.


On this submit, we launched the Apache Iceberg framework and the way it helps resolve a few of the challenges now we have in a contemporary knowledge lake. Then we walked you although an answer to course of incremental knowledge in an information lake utilizing Apache Iceberg. Lastly, we had a deep dive into efficiency tuning to enhance learn and write efficiency for our use instances.

We hope this submit offers some helpful info so that you can determine whether or not you wish to undertake Apache Iceberg in your knowledge lake answer.

In regards to the Authors

Flora Wu is a Sr. Resident Architect at AWS Information Lab. She helps enterprise clients create knowledge analytics methods and construct options to speed up their companies outcomes. In her spare time, she enjoys taking part in tennis, dancing salsa, and touring.

Daniel Li is a Sr. Options Architect at Amazon Internet Companies. He focuses on serving to clients develop, undertake, and implement cloud companies and technique. When not working, he likes spending time outdoor along with his household.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments