Stay Connected with Data Blueprint

  • White LinkedIn Icon
  • White Facebook Icon
  • White Twitter Icon
  • White YouTube Icon
icon-cio-sp3_edited.png
DAMA_W.png

© 2020 Data Blueprint All Rights Reserved.

Search
  • Data Blueprint

Does your Data Scientist hate your data?

"This role isn’t the right fit for you"


Those are the words delivered by my manager right after he asked, “How was your weekend?”. His lack of emotional intelligence and professional tact aside, the time had come for me to leave my position. This, after eight months of working in three separate relational database systems, generating visualizations from cobbled-together data sets, and writing scripts that attempted to fix underlying issues in the company’s data.

While reflecting on my experiences at the company, it would have been easy for my confidence to be shaken. I could have falsely accepted that the problem lay not in the data structures but in my accused ‘lack of attention to detail’. After we agreed that I was not set up for success, I left that day proud of what I had accomplished, but also disappointed about what I had not. I was excited to build many data products that would have helped the business but was unable to do so due to a lack of a well-modeled data warehouse. Sound familiar?


Data Warehouse? Can’t I just hire a data scientist?


The Data Scientist, the sexiest job of the 21st century…if I had a nickel. To be clear, my role at the company was an analyst role, and not a data scientist role. However, I did have opportunities to work on advanced analysis and model what was known to be fraudulent activity. This work was largely complicated by the need to script queries and joins across multiple databases, then clean and label the data, before finally moving on to actually building a model. The initial results from the model didn’t make sense, and researching those issues helped identify more complications in the data set construction. This is where a data warehouse becomes useful.


Why you need a data warehouse


Can your $150,000 dollar a year resource script in Python to clean your data? Sure. Can you crawl under your car on a hot Saturday morning and change your oil? Probably. However, both scenarios are not the optimal use of time for either party. Your Data Scientist should be focused on building models, hyperparameter tuning and optimizing their outputs. The hours spent cleaning and creating a dataset for model input is not their job. Ideally, your Data Scientists should be able to access specific information marts that relate to the business problem they are trying to solve.


Types of Data Warehouse


Unfortunately, a data warehouse is not a one size fits all solution. You can’t sign in to your AWS or Azure portal, click “create data warehouse” and get going, but maybe that should be a feature request. The flip side of that coin is it doesn’t have to take years of requirements gathering and analysis to construct a data warehouse. Data Warehouses gained notoriety in 1992 when Bill Inmon published Building the Data Warehouse . His was a top down approach that aimed to stage data from disparate sources, load into the data warehouse and pass on to end users via data marts. This approach should feel familiar to those experienced with databases, as it is a fully normalized structure.



In 1996 Ralph Kimball published the data warehouse toolkit, which is a dimensional approach to data warehousing. In the dimensional approach, you are concerned with facts and dimensions, built in a star schema and are NOT worried about normalizing the data.




The Inmon and the Kimball approach are both very functional for enterprises, which is why they have been a dominating force in the market. What limits those methodologies, is that you need to plan out your entire enterprise in advance. The approaches are not designed to be flexible to changes in needs, requirements and lines of business. To make up for that, Dan Linstedt created the Data Vault framework and released it in 2000. The data vault allows for the enterprise to warehouse specific units or functions of the business one at a time. Often, this is done in accordance with Agile practices which makes the methodology even more desirable.




No matter which methodology you decide to use, there are multiple ways to get to a desirable outcome. If you aren’t sure where to begin gathering requirements, which DW is right for you, or how to best serve data to your Advanced Analysts, contact Data Blueprint today.