Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Scheduling Databricks Cluster Uptime

Problem

Interactive and SQL Warehouse (formerly known as SQL Endpoint) clusters take time to become active. This can range from around 5 mins through to almost 10 mins. For some workloads and users, this waiting time can be frustrating if not unacceptable.
For this use case, we had streaming clusters that needed to be available for when streams started at 07:00 and to be turned off when streams stopped being sent at 21:00. Similarly, there was also need from business users for their SQL Warehouse clusters to be available for when business started trading so that their BI reports didn't timeout waiting for the clusters to start.

Solution

A simple solution to both problems, doesn't yet exist. It could be possible to use SQL Serverless for the Warehouse clusters. However, at the time of writing, Serverless compute is still in private preview. And the equivalent for Engineering workloads still doesn't really exist.

Therefore, we need to have a method for warming up the clusters before they get used. And this is where the API comes in handy. The Databricks API is incredibly powerful and allows you to programmatically control your Databricks experience.
The process for the programmatically scheduling the warm-up of Engineering and Warehouse clusters are the same but the API endpoints are different, therefore the code is different (notebooks below).

Process

Cluster Scheduling Process: Pass in Params, Generate Bearer Token, Start / Stop Cluster(s)

Engineering Cluster Notebook

SQL Warehouse Cluster Notebook

Conclusion

With our new notebooks, using the API, we can now set them on a schedule using a job cluster to start or stop at specific times of day.

For Engineering clusters with a streaming workload, this allows the clusters to be available when the streams start, as well as allowing the VMs some downtime for maintenance once streams stop.

For SQL Warehouse clusters, this allows the clusters to be available for users to refresh their reports and query data before they've started work - so no more waiting for clusters to start.

There are so many more uses for the Databricks API and we've just touched on a few here. If you're using the API for a different use case, pop it in the comments below as I'm interested to know how other people use the API.

Originally published on UstDoes.tech

Ust OldfieldComment