Blog — Advancing Analytics
Data Science | AI | DataOps | Engineering
backgroundGrey.png

Blog

Data Science & Data Engineering blogs

Testing Data

This blog post will cover the topic of testing data within DBT, focusing on the easiest aspect first - verifying the reliability of the query applied to the data. The author will discuss the importance of defining what is being tested and how this can impact the validity of the data. The other testing questions, more aligned with integration and regression testing, will be saved for another time.

Read More
Image Classification — Dealing with Imbalance in Datasets

Image classification is a standard computer vision task and involves training a model to assign a label to a given image, such as a model to classify images of different root vegetables. A big problem with classification is bias, and the models favouring a particular image class above the others. A common cause of this can be dataset imbalance, and it is often hard to spot as a model trained on an imbalanced dataset can often still have good accuracy. E.g. if there are 1000 images in the test dataset, 950 potatoes and 50 carrots and the model predicted all 1000 images to be potatoes it would still have 95% accuracy. This is also an example of why more metrics than accuracy should be considered… but let’s leave that discussion for another day.

Read More
Alexander BillingtonComment