Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Working with Data Schemas and Models in Scala: Case Classes

This is the third of my blogs in the Scala Parlour Series, in which we explore Scala, and why it is great for Data Engineering. If you haven’t already, please check out the others in the series, starting here , in which you can read all about the core concepts of Scala, including who uses it and why. 

Ok, so you’ve got a load of data you want to work with between ETL steps. For example, if we think back to our previous blog on Monads: We’re retrieving some data from a 3rd party API and it’s in XML, but we want to store it in your Data lake in JSON format. How do you go about this with Scala? How can you map your XML to a *Type Safe model?  

*Type safety means once a property Type is set you cannot change it. So, you cannot change a String into an Int, once assigned as a String it is always a String. Or in simpler terms, you cannot turn a Cat into a Dog. 

scala_catdog.png

Ok, so how do you model data in Scala? Case Classes! 

A Scala Case Class acts like a blueprint for data. You can define a Case Class with all the fields you expect your data to have, so effectively predefining your schema. For example, the below code defines an “Employee” Case Class, with the fields FirstNameLastName and Age   

case class Employee(FirstName:String, LastName:String, Age:Int)  

Ok, but you could do this in your usual language of choice, right? What makes a Case Class so special? Well, you get a tone of stuff out of the box with Case Classes that aid your data journeys. These include:  

  • Automatically-generated equals and hashCode methods, so instances can be compared 

  • Automatically-generated copy method that’s extremely helpful when you need to perform the process of a) cloning an object and b) updating one or more of the fields during the cloning process 

  • When you define a Class as a Case Class, you don’t have to use the new keyword to create a new instance 

Let’s explore! 

Automatically-generated equals and hashCode  

So, in languages such as C# or Python, if I were to compare 2 “Employee” object instances, they would do exactly that, compare the objects but NOT the data inside. However, we don’t care if it’s the same “object” or not, we want to know if the 2 Employee objects have the same data. Well, Case Classes do this out of the box: 

case class Employee(FirstName:String, LastName:String, Age:Int) 
 
val emp1 = Employee("Ada", "Lovelace", 7) 
val emp2 = Employee("Ada", "Lovelace", 7) 
 
println(emp1 == emp2) 

So, in the above we are comparing two instances of Employee emp1 and emp2. The result of this comparison will return a Boolean of True, as, you guessed it, they are the same. 

What else can we do? 

Lots of things, in particular, we can use Scala Pattern Matching with Case Classes, a very popular tool throughout Scala: 

case class Employee(FirstName:String, LastName:String, Age:Int) 
 
val emp1 = Employee("Ada", "Lovelace", 7) 
 
val result = emp1 match { 
  case Employee("Ada", "Lovelace", 7) => "Employee match" 
  case _ => _.FirstName 
} 
 
print(result) 

In the above we are matching on the Employee instance emp1: It will either return “Employee match” or the FirstName of emp1 “Ada”. This is a simple example, but this could easily be expanded to do more complex comparisons and expansions with our data. 

So that’s the “equals” but what’s “hashing”?  

Some of you may have already worked with Hashing when comparing data, but for those who haven’t: hashing enables you to create a unique “hash” code of your data. Why would you want to do this? So that you can look for changes in your data without comparing every single field one by one. This is particularly prevalent when looking for changes in Factual and Dimensional data (Star Schemas), most frequently found in our old friend the Data Warehouse, Data Lakes or, more recently Data Lakehouses. 

Observe: 

case class Employee(FirstName:String, LastName:String, Age:Int) 
 
val emp1 = Employee("Ada", "Lovelace", 7) 
val emp2 = Employee("Anna", "Wykes", 8) 
 
println(emp1.hashCode()) 
println(emp2.hashCode()) 

Results: 

Employee(Ada,Lovelace,7) 
Employee(Dave,Lovelace,7) 

The above code takes a copy of our emp1 Employee instance before changing the FirstName to “Dave”, creating a completely new instance with the changes applied. 

You don’t have to use the “new” keyword every time 

So, any C# or Java developers out there will be used to creating instances of classes via the new instantiation, but this is not needed for Case Classes.  What do I mean? Observe: 

val emp1 = Employee("Ada", "Lovelace", 7) 
val emp2 = new Employee("Ada", "Lovelace", 7) 

The above 2 lines of code do exactly the same. However, the “new” is not needed, as it would be in a lot of languages.  

Summary 

So, Case Classes are super useful for data modelling. In short, they represent a way to define a class in a single line of code. We can utilize Type Safety, Pattern Matching, Hashing and much more.  

A hands-on example of Case Classes being utilized, including serialization/deserialization can be found in my previous blog here on Monads, in which you can observe the Employee and Error Case Classes. 

 

Thankyou for reading , and let me know if you have any questions around Case Classes & Scala!