In this article, I’ll share five simple yet powerful tips to boost your Databricks performance. These strategies can transform your workflows and save you time and money.
- Use larger clusters: Often overlooked, larger clusters can be more cost-effective than smaller ones. They complete tasks faster, saving time and money.
- Leverage Photon: Databricks’ high-speed execution engine is a game changer.
- Clean up your configurations: Outdated Spark configurations can slow you down.
- Utilize Delta Caching: Effective caching can significantly speed up your processes.
- Understand lazy evaluation: This concept is crucial for Spark coding.
1. Supercharge Your Clusters
One common mistake is using clusters that are too small, often due to cost concerns. Small clusters with minimal cores lead to slow performance. However, larger clusters can be more cost-effective because they complete tasks faster.
For example, a two-worker cluster might take an hour to finish a job, costing you for the full hour. In contrast, a four-worker cluster might complete the same job in half the time, resulting in the same cost. This principle holds true as long as there’s enough work to keep the cluster busy.
2. Embrace Photon
Photon, Databricks’ new execution engine, is built in C++ and offers impressive speed improvements. It utilizes CPU-level performance features and better memory management, significantly accelerating operations like joins, aggregations, and ETL processes.
Photon excels with built-in functions and operations, as well as writes to Parquet or Delta formats. However, it won’t speed up custom user-defined functions (UDFs) or jobs spending most of their time reading from outdated databases. Despite these limitations, the boost it provides in native operations is substantial.
3. Purge Old Configurations
Those old Spark configurations you’ve carried over from one version to another could be hindering performance. Jobs can go from hours to minutes by cleaning out outdated settings. These configurations might have been useful for older versions or specific quirks, but they could now be causing issues. Revisiting and potentially resetting to default configurations can often lead to better performance.
4. Leverage Delta Caching
Delta Cache loads data from cloud storage into the workers’ SSDs for faster access. This can be a huge performance booster, especially for BI tasks that repeatedly read the same tables. If you’re using Databricks SQL Endpoints, caching is on by default, and you can use commands like CACHE SELECT * FROM table to preload frequently accessed tables.
For regular clusters, using instances with fast SSDs (such as the i3 series on AWS, L or E series on Azure, or n2 on GCP) will enable caching by default. While caching can dramatically speed up tasks that repeatedly read the same data, its benefits might be limited for one-time read operations typical of some ETL jobs. Assess your job patterns to maximize the benefits of Delta Caching.
5. Be Mindful of Lazy Evaluation
Lazy evaluation means Spark builds an execution plan and delays actual computation until an action like displaying or writing results is performed. This is great for optimizing performance but can lead to repeated computations if not handled carefully.
For example, if you display or write results multiple times, Spark re-runs the execution plan each time. To avoid this, save intermediate results that you’ll reuse. This approach minimizes unnecessary re-computation and leverages the optimization benefits of lazy evaluation.
# Create the dataframe
data = [
(1, "Alice", 70000),
(2, "Bob", 85000),
(3, "Catherine", 95000),
(4, "David", 60000),
(5, "Eva", 105000)
]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
# Lazy transformations (no actual computation happens here)
df_filtered = df.filter(col("salary") > 2000)
df_selected = df_filtered.select("name", "salary")
# Action (computation happens here)
df_selected.show()
# If we perform another action without caching, the computation will be repeated
df_selected.show()
# To avoid this re-computation we can either cache the result....
df_selected.cache()
# .... or write it to a file
output_path = "/tmp/spark_output"
df_selected.write.mode("overwrite").parquet(output_path)
df_read_back = spark.read.parquet(output_path)
df_read_back.show()
By applying these tips, you can significantly enhance the performance of your Databricks jobs. Keep experimenting and optimizing—your future self will thank you!