Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Scala For Big Data Engineering – Monads

This is the second of my blogs in the Scala Parlour Series, in which we explore Scala, and why it is great for Data Engineering. If you haven’t already, please check out the first in the series here, in which you can read all about the core concepts of Scala, including who uses it and why 

In this article we will explore monads within the Functional Programming (FP) paradigm, and how they can be used in Scala to aid Data Engineering.  

Monads 

You can’t talk about FP without monads. For those who have not encountered FP this can be a confusing concept, aided by a lot of convoluted explanations of what one actually is. Monads are routed in mathematics and computer science and so it can be tricky to find a clear, simple explanation without a lot of detailed theory upfront. For example, the definition from Wikipedia describing it as: “A monad is a design pattern that allows structuring programs generically while automating away boilerplate code needed by the program logic” - not exactly clear what it actually is! 

So what is a monad really and how can it help you within your Data Engineering projects?    

Convoluted explanations aside, in short, a monad is a sequence of events with a “get out of jail card”. 

monads2.png

Still don’t get what this means? How can you apply this in Scala?

Let’s look at some code, using a “for comprehension”. This is a code construct in Scala that you will see used frequently. It’s basically a monad in itself but people getting started with Scala can easily be confused as it looks very similar to “for each loops”, which exist across most programming languages (including Scala).

So, let’s start by observing the below Scala code - don’t worry, you don’t need to understand everything!


import java.time.format.DateTimeFormatter
import java.time.{Instant, LocalDateTime, ZoneId}
import scala.util.Try
import scala.xml._

class Parser() {
  def apply(payload: String): Either[Error, Employee] =
  for {
      xml              <- parse(payload)
      firstname        <- findIn(xml, x => (x >> "firstname").text)
      lastname         <- findIn(xml, x => (x >> "lastname").text)
      age              <- findIn(xml, x => (x >> "age").text.toInt)
    } yield Employee(firstName = firstname,lastName = lastname, age = age)

  private def parse(payload: String): Either[Error, Elem] =
     tryWith(e => s"Invalid XML: ${e.getMessage}") {
       XML.loadString(payload)
     }

  private def tryWith[T](toErrorMessage: Throwable => String)(f: => T): Either[Error, T] =
     Try { f }.toEither.fold(e => Left(Error(toErrorMessage(e))), Right(_))

  private def findIn[T](node: Node, path: Node => T): Either[Error, T] =
     tryWith(e => s"Could not find element: ${e.getMessage}") {
       path(node)
     }

  implicit class NodeOps(node: Node) {
    def >>(name: String): Node = {
      val children = node \ name
      if (children.size != 1) throw new Exception(s"Cannot find unique child '$name'")
      children.head
    }
  }
}

case class Error(reason: String)
case class Employee(firstName: String, lastName:String, age: Int)

So what’s this doing?? If you are just getting started with Scala, the above might not make a lot of sense, but don’t worry.  

The part we are interested in is the For Comprehension: 

   for {
      xml              <- parse(payload)
      firstname        <- findIn(xml, x => (x >> "firstname").text)
      lastname         <- findIn(xml, x => (x >> "lastname").text)
      age              <- findIn(xml, x => (x >> "age").text.toInt)
    } yield Employee(firstName = firstname,lastName = lastname, age = age)

So what’s all that crazy code doing?

Firstly, lets summarise what this code is trying to achieve: It is taken from an API that receives XML payloads and transforms it to JSON. For reference the XML format expected is:

<employee> 
   <firstname></firstname>
  <lastname><lastname>
   <age></age>
</employee>

At any stage, this transformation could fail, normally when elements are missing within the XML. Ordinarily, with OO (Object Orientated) and scripting languages would handle this anticipated error by wrapping a “try catch” statement around our code. However, this is frowned upon in FP and instead monads are commonly used, but why is that?

Let’s get down and dirty into what the code is doing

Each line in the For Comprehension is a “step” (as our monopoly player found out above) in the monad. Each step is attempting to do something with the xml. Firstly, xml <- parse(payload) is attempting to parse the data received as XML, then all subsequent lines are attempting to extract the firstname, lastname and age from the parsed xml:

firstname        <- findIn(xml, x => (x >> "firstname").text)
lastname         <- findIn(xml, x => (x >> "lastname").text)
age              <- findIn(xml, x => (x >> "age").text.toInt)

So we have four separate, sequential statements we are trying to execute within this comprehension.

So what happens if it fails?

This is the beauty of a monad - if at any point this process fails it will stop and, in this instance, return an “Error” which is your “get out of jail card”. I’m sure you are now wondering what this “Error” type is, and where do we define it?

The eagle-eyed among you might have spotted that before creating this monad it was declared that the monad, and accompanying “apply” function, would either return JSON or an error. So, two “Case Classes” were defined: Error, and Employee:

case class Error(reason: String)
case class Employee(firstName: String, lastName:String, age: Int)

What’s a Case Class?

A “Case Class”, is Scala’s way of defining models, I will explore them further in later blogs but for now this definition is suffice.

So, now the accompanying (apply) method will return one of these types using an “Either”:

def apply(payload: String): Either[Error, Employee] 

What’s an Either?

An “Either” in Scala is exactly what it says it is, it will either return one type or another. In our case Error or Employee

So, we have our For Comprehension that processed some XML and returns it as JSON, this is what the “yield” is doing (case class’s return their content not the type/object, so for us we get the equivalent of JSON data):

yield Employee(firstName = firstname,lastName = lastname, age = age) 

And, alternatively, if the monad fails it will return an instance of the Error Case Class.

As Scala is Type Safe (explained it my previous article), it will not compile if any of the entries in the For Comprehension do not return an Either of type Employee or Error. This enforces Type Safety, and ultimately makes sure everything slots together correctly. Like when putting Lego together, if you piece doesn’t fit the expected shape/mould you cannot add it and carry on building

lego.jpg

But still, why not just use a Try Catch?

Try Catch is purely for error handling, here we are talking about "happy and unhappy paths"

You're anticipating the outcomes and making sure every method/function in the chain conforms to the possible outcome types, rather than wrapping a try catch around your code and saying "something could go wrong, I'll catch it if it does". In this instance Employee or Error will be returned, hence all the Lego pieces slotting together. Rather than an Either I could use a Scala Try as the example (not to be mistaken with try catch).

A Try will return a Success or Failure type with the success holding the result. This in turn can be handled in a Scala “case statement” or, shorthand, a Scala “map”. But I thought that would be harder to understand as it looks too much like a try catch and I'm potentially getting maps involved!  Moreover, an Either is commonly used for catching Errors but it's not what it was originally designed for, it's literally an Either this or that

Ok, but what does the “Error” return and how do we use it?

We have or Error return in an instance of the Case Class, so essentially, we can do whatever we want with it. The usual approach to this would be to log and display the error accordingly. As this is an API  the caller of the API will have a the error message returned to them, thus the caller will know the xml sent was invalid (also, as this is an API we would make sure this is within a 4xx response)

The possible error messages are defined in the following methods: Parse, tryWith, NodeOps (see above code). These different methods have a lot of depth to them, so we’ll explore them later in this series.

Summary

So, the above implicitly demonstrates how we can use a monad to manipulate and transform data with the safety and security of knowing our exceptions will be handled accordingly. Thus, monads can prove invaluable within Data Engineering, in particular ETL processes.

Functional Programming is a huge subject, so we will only be cherry pick some of the most useful tools in a FP programmers’ belt in this series. However, if you are interested in exploring the full paradigm, I’d recommend the following books:

  • Functional Programming In Scala: Paul Chiusano, Runar Bjarnason

  • Programming Scala: Scalability = Functional Programming: Alex Payne, Allan D. Payne

Thank you for reading, and let me know if you have any questions around monads & Scala!

Anna WykesComment