Azure Databricks. The Blog of 60 questions. Part 5
Co-written by Terry McCann & Simon Whiteley.
A few weeks ago we delivered a condensed version of our Azure Databricks course to a sold out crowd at the UK's largest data platform conference, SQLBits. The course was a condensed version of our 3-day Azure Databricks Applied Azure Databricks programme. During the course we were ask a lot of incredible questions. This blog all of those questions and a set of detailed answers. If you are looking for Accelerating your journey to Databricks, then take a look at our Databricks services.
There were over 60 questions. Some are a little duplicated, some require a lot more detail than others. 60 is too many to tackle in one blog. This will be the first of 6 blogs going in to detail on the questions. They are posted in the order they were asked. I have altered the questions to give them more context. Thank you to all those who asked questions.
Part one. Questions 1 to 10
Part two. Questions 11 to 20
Part three. Questions 21 to 30
Part four. Questions 31 to 40
Part five. Questions 41 to 50
Part six. Questions 51 to 63
Q41: Will .option("mode", "PERMISSIVE") work for Scala workloads?
A: Yes
https://docs.databricks.com/spark/latest/data-sources/read-csv.html#verify-correctness-of-the-data
Q42: Is there a standard design / dev pattern on how to work with metadata (extracting, updating, reusing)?
A: I would love to say yes and here it is, but no. Thee is no accepted framework. It is worth checking ISO for various standards on metadata for your industry.
Q43: Does a json file with an array on the root level qualify as the 'massive data file' which cannot be handled multithreaded?
A: This relates to how we can process and chunk up a large JSON file. If your JSON file is delimited with a carriage return and line feed, and holds multiple documents then it should chink up. If it is a huge file with CR/LF after each block this might not. It is worth a try and see what you can do to optimise it.
Q44: is there a magic figure of rows for parquet to compress into a compressed row group ( like in SQL server it is 1M)
A: This question concerns the behaviour of Parquet in compression to columnstore compression in SQL Server. I am not aware of any limit for compression.
Q45: Is it reasonable to be swapping back and forth many times between languages in a notebook script?
A: Covered in another question. Yes, but try not to.
Q46: All of these languages, why not C#, and don't say it is not academic or that it is slow...
A: This is going to be supported. https://github.com/dotnet/spark
Q47: Is there a risk that Databricks allows cost savings by reducing expensive/niche data science resource that spend 70/80% of their time data wrangling, only to replace it with expensive/niche data engineers that need to be proficient/efficient in several languages to be able to maintain the Databricks/pipeline estate?
A: This is a valid argument, but removing time from a scarce expensive resource to an automated process will allow the Data Science team to work more effectively and create a better return-on-investment.
Q48: Not sure if I missed something. SQL Data Warehouse is Azure only right?
A: Sort of. Yes it is only Azure, but it is based on PDW which is on-premises.
Q49: Azure DWH vs. Databricks - when would you chose Databricks over Azure DWH (Polybase / In Mem / language support etc.?)
A: Too big a question to answer here. When I have more time I will come back to this with a full answer. For now.
Languages. If the team only know SQL then use ASDW
If you want to do Machine Learning the Databricks
Streaming? Databricks
Finer-grain cost management? Databricks
Q50: Doesn't ADF data flows go against the whole Extract-Load-Transform (ELT) pattern that everything else in the MS Azure ecosystem is built around?
A: No. Although you're building what looks like a SSIS dataflow, this will compile to an RDD and run as a Map Reduce job where the data lives. This is still ELT.
Data Platform Microsoft MVP & Voice of Data Science in Production You can follow Terry on twitter @SQLShark where he is frequently discussing Data Science in Production.