Optimization Methods for Iceberg Tables

Posted in Technical |
February 14, 2024 9 min learn

Introduction

Apache Iceberg has lately grown in reputation as a result of it provides information warehouse-like capabilities to your information lake making it simpler to research all of your information—structured and unstructured. It presents a number of advantages reminiscent of schema evolution, hidden partitioning, time journey, and extra that enhance the productiveness of information engineers and information analysts. Nevertheless, that you must often preserve Iceberg tables to maintain them in a wholesome state in order that learn queries can carry out quicker. This weblog discusses a number of issues that you simply may encounter with Iceberg tables and presents methods on tips on how to optimize them in every of these eventualities. You’ll be able to reap the benefits of a mix of the methods offered and adapt them to your specific use circumstances.

Downside with too many snapshots

Everytime a write operation happens on an Iceberg desk, a brand new snapshot is created. Over a time frame this will trigger the desk’s metadata.json file to get bloated and the variety of previous and doubtlessly pointless information/delete recordsdata current within the information retailer to develop, growing storage prices. A bloated metadata.json file might enhance each learn/write instances as a result of a big metadata file must be learn/written each time. Frequently expiring snapshots is beneficial to delete information recordsdata which can be now not wanted, and to maintain the scale of desk metadata small. Expiring snapshots is a comparatively low-cost operation and makes use of metadata to find out newly unreachable recordsdata.

Answer: expire snapshots

We will expire previous snapshots utilizing expire_snapshots

Downside with suboptimal manifests

Over time the snapshots may reference many manifest recordsdata. This might trigger a slowdown in question planning and enhance the runtime of metadata queries. Moreover, when first created the manifests might not lend themselves nicely to partition pruning, which will increase the general runtime of the question. Alternatively, if the manifests are nicely organized into discrete bounds of partitions, then partition pruning can prune away total subtrees of information recordsdata.

Answer: rewrite manifests

We will remedy the too many manifest recordsdata downside with rewrite_manifests and doubtlessly get a well-balanced hierarchical tree of information recordsdata.

Downside with delete recordsdata

Background

merge-on-read vs copy-on-write

Since Iceberg V2, each time current information must be up to date (by way of delete, replace, or merge statements), there are two choices accessible: copy-on-write and merge-on-read. With the copy-on-write choice, the corresponding information recordsdata of a delete, replace, or merge operation might be learn and completely new information recordsdata might be written with the mandatory write modifications. Iceberg doesn’t delete the previous information recordsdata. So if you wish to question the desk earlier than the modifications had been utilized you should use the time journey characteristic of Iceberg. In a later weblog, we are going to go into particulars about tips on how to reap the benefits of the time journey characteristic. For those who determined that the previous information recordsdata should not wanted any extra then you may do away with them by expiring the older snapshot as mentioned above.

With the merge-on-read choice, as an alternative of rewriting all the information recordsdata in the course of the write time, merely a delete file is written. This may be an equality delete file or a positional delete file. As of this writing, Spark doesn’t write equality deletes, however it’s able to studying them. The benefit of utilizing this selection is that your writes could be a lot faster as you aren’t rewriting a complete information file. Suppose you wish to delete a selected consumer’s information in a desk due to GDPR necessities, Iceberg will merely write a delete file specifying the areas of the consumer information within the corresponding information recordsdata the place the consumer’s information exist. So each time you’re studying the tables, Iceberg will dynamically apply these deletes and current a logical desk the place the consumer’s information is deleted although the corresponding data are nonetheless current within the bodily information recordsdata.

We allow the merge-on-read choice for our clients by default. You’ll be able to allow or disable them by setting the next properties primarily based in your necessities. See Write properties.

Serializable vs snapshot isolation

The default isolation assure offered for the delete, replace, and merge operations is serializable isolation. You may additionally change the isolation stage to snapshot isolation. Each serializable and snapshot isolation ensures present a read-consistent view of your information. Serializable Isolation is a stronger assure. As an illustration, you’ve got an worker desk that maintains worker salaries. Now, you wish to delete all data akin to staff with wage higher than $100,000. Let’s say this wage desk has 5 information recordsdata and three of these have data of staff with wage higher than $100,000. If you provoke the delete operation, the three recordsdata containing worker salaries higher than $100,000 are chosen, then in case your “delete_mode” is merge-on-read a delete file is written that factors to the positions to delete in these three information recordsdata. In case your “delete_mode” is copy-on-write, then all three information recordsdata are merely rewritten.

No matter the delete_mode, whereas the delete operation is occurring, assume a brand new information file is written by one other consumer with a wage higher than $100,000. If the isolation assure you selected is snapshot, then the delete operation will succeed and solely the wage data akin to the unique three information recordsdata are eliminated out of your desk. The data within the newly written information file whereas your delete operation was in progress, will stay intact. Alternatively, in case your isolation assure was serializable, then your delete operation will fail and you’ll have to retry the delete from scratch. Relying in your use case you may wish to cut back your isolation stage to “snapshot.”

The issue

The presence of too many delete recordsdata will finally cut back the learn efficiency, as a result of in Iceberg V2 spec, everytime an information file is learn, all of the corresponding delete recordsdata additionally have to be learn (the Iceberg group is at present contemplating introducing an idea known as “delete vector” sooner or later and that may work otherwise from the present spec). This may very well be very expensive. The place delete recordsdata may comprise dangling deletes, as in it may need references to information which can be now not current in any of the present snapshots.

Answer: rewrite place deletes

For place delete recordsdata, compacting the place delete recordsdata mitigates the issue somewhat bit by decreasing the variety of delete recordsdata that have to be learn and providing quicker efficiency by higher compressing the delete information. As well as the process additionally deletes the dangling deletes.

Rewrite place delete recordsdata

Iceberg supplies a rewrite place delete recordsdata process in Spark SQL.

However the presence of delete recordsdata nonetheless pose a efficiency downside. Additionally, regulatory necessities may pressure you to finally bodily delete the information somewhat than do a logical deletion. This may be addressed by doing a serious compaction and eradicating the delete recordsdata fully, which is addressed later within the weblog.

Downside with small recordsdata

We usually wish to reduce the variety of recordsdata we’re touching throughout a learn. Opening recordsdata is expensive. File codecs like Parquet work higher if the underlying file dimension is giant. Studying extra of the identical file is cheaper than opening a brand new file. In Parquet, usually you need your recordsdata to be round 512 MB and row-group sizes to be round 128 MB. In the course of the write section these are managed by “write.target-file-size-bytes” and “write.parquet.row-group-size-bytes” respectively. You may wish to depart the Iceberg defaults alone until you understand what you’re doing.

In Spark for instance, the scale of a Spark process in reminiscence will have to be a lot increased to succeed in these defaults, as a result of when information is written to disk, it will likely be compressed in Parquet/ORC. So getting your recordsdata to be of the fascinating dimension just isn’t straightforward until your Spark process dimension is large enough.

One other downside arises with partitions. Except aligned correctly, a Spark process may contact a number of partitions. Let’s say you’ve got 100 Spark duties and every of them wants to jot down to 100 partitions, collectively they are going to write 10,000 small recordsdata. Let’s name this downside partition amplification.

Answer: use distribution-mode in write

The amplification downside may very well be addressed at write time by setting the suitable write distribution mode in write properties. Insert distribution is managed by “write.distribution-mode” and is defaulted to none by default. Delete distribution is managed by “write.delete.distribution-mode” and is defaulted to hash, Replace distribution is managed by “write.replace.distribution-mode” and is defaulted to hash and merge distribution is managed by “write.merge.distribution-mode” and is defaulted to none.

The three write distribution modes which can be accessible in Iceberg as of this writing are none, hash, and vary. When your mode is none, no information shuffle happens. You need to use this mode solely once you don’t care concerning the partition amplification downside or when you understand that every process in your job solely writes to a selected partition.

When your mode is about to hash, your information is shuffled by utilizing the partition key to generate the hashcode so that every resultant process will solely write to a selected partition. When your distribution mode is vary, your information is distributed such that your information is ordered by the partition key or type key if the desk has a SortOrder.

Utilizing the hash or vary can get tough as you at the moment are repartitioning the information primarily based on the variety of partitions your desk may need. This could trigger your Spark duties after the shuffle to be both too small or too giant. This downside could be mitigated by enabling adaptive question execution in spark by setting “spark.sql.adaptive.enabled=true” (that is enabled by default from Spark 3.2). A number of configs are made accessible in Spark to regulate the habits of adaptive question execution. Leaving the defaults as is until you understand precisely what you’re doing might be the best choice.

Although the partition amplification downside may very well be mitigated by setting right write distribution mode acceptable on your job, the resultant recordsdata might nonetheless be small simply because the Spark duties writing them may very well be small. Your job can’t write extra information than it has.

Answer: rewrite information recordsdata

To handle the small recordsdata downside and delete recordsdata downside, Iceberg supplies a characteristic to rewrite information recordsdata. This characteristic is at present accessible solely with Spark. The remainder of the weblog will go into this in additional element. This characteristic can be utilized to compact and even increase your information recordsdata, incorporate deletes from delete recordsdata akin to the information recordsdata which can be being rewritten, present higher information ordering in order that extra information may very well be filtered straight at learn time, and extra. It is likely one of the strongest instruments in your toolbox that Iceberg supplies.

RewriteDataFiles

Iceberg supplies a rewrite information recordsdata process in Spark SQL.

See RewriteDatafiles JavaDoc to see all of the supported choices.

Now let’s talk about what the technique choice means as a result of you will need to perceive to get extra out of the rewrite information recordsdata process. There are three technique choices accessible. They’re Bin Pack, Type, and Z Order. Notice that when utilizing the Spark process the Z Order technique is invoked by merely setting the sort_order to “zorder(columns…).”

Technique choice

Bin Pack
- It’s the most cost-effective and quickest.
- It combines recordsdata which can be too small and combines them utilizing the bin packing strategy to cut back the variety of output recordsdata.
- No information ordering is modified.
- No information is shuffled.
Type
- Rather more costly than Bin Pack.
- Supplies whole hierarchical ordering.
- Learn queries solely profit if the columns used within the question are ordered.
- Requires information to be shuffled utilizing vary partitioning earlier than writing.
Z Order
- Costliest of the three choices.
- The columns which can be getting used ought to have some type of intrinsic clusterability and nonetheless have to have a enough quantity of information in every partition as a result of it solely helps in eliminating recordsdata from a learn scan, not from eliminating row teams. In the event that they do, then queries can prune a number of information throughout learn time.
- It solely is sensible if a couple of column is used within the Z order. If just one column is required then common type is the higher choice.
- See https://weblog.cloudera.com/speeding-up-queries-with-z-order/ to study extra about Z ordering.

Commit conflicts

Iceberg makes use of optimistic concurrency management when committing new snapshots. So, once we use rewrite information recordsdata to replace our information a brand new snapshot is created. However earlier than that snapshot is dedicated, a test is finished to see if there are any conflicts. If a battle happens all of the work completed might doubtlessly be discarded. It is very important plan upkeep operations to reduce potential conflicts. Allow us to talk about a few of the sources of conflicts.

If solely inserts occurred between the beginning of rewrite and the commit try, then there aren’t any conflicts. It’s because inserts end in new information recordsdata and the brand new information recordsdata could be added to the snapshot for the rewrite and the commit reattempted.
Each delete file is related to a number of information recordsdata. If a brand new delete file corresponding to an information file that’s being rewritten is added in future snapshot (B), then a battle happens as a result of the delete file is referencing an information file that’s already being rewritten.

Battle mitigation

For those who can, strive pausing jobs that may write to your tables in the course of the upkeep operations. Or a minimum of deletes shouldn’t be written to recordsdata which can be being rewritten.
Partition your desk in such a means that every one new writes and deletes are written to a brand new partition. As an illustration, in case your incoming information is partitioned by date, all of your new information can go right into a partition by date. You’ll be able to run rewrite operations on partitions with older dates.
Benefit from the filter choice within the rewrite information recordsdata spark motion to finest choose the recordsdata to be rewritten primarily based in your use case in order that no delete conflicts happen.
Enabling partial progress will assist save your work by committing teams of recordsdata previous to all the rewrite finishing. Even when one of many file teams fails, different file teams might succeed.

Conclusion

Iceberg supplies a number of options {that a} fashionable information lake wants. With somewhat care, planning and understanding a little bit of Iceberg’s structure one can take most benefit of all of the superior options it supplies.

To strive a few of these Iceberg options your self you may sign up for considered one of our subsequent stay hands-on labs.

You may also watch the webinar to study extra about Apache Iceberg and see the demo to study the most recent capabilities.