I’m the data guy, but… which one exactly?

A lot of times I’m asked about my job by my relatives, friends or even a bank or administration office, and to be honest sometimes it is pretty hard for me to answer. Not that I wouldn’t know what I do, but when sometimes I’m saying I’m BI expert or Data Engineer they ask what? Then I say it is kind of IT so they are asking me to fix their computer… ๐Ÿ™‚

Well, this is really for ones who are very far from technology, but data domain is still a less known part of the organization at all size of companies, even it is an extremely increasing domain globally. But how companies should start a data team to be a data driven organization? This is my view on this.

First of all, let’s see what is the journey of the data where it gets to the users as information. The is from the different data sources: applications, excel files, databases, internet sites, etc. after gathered then from the sources, at least couple of transformation or enrichment is needed and after all some nice chart are telling the team what they should do with their business. Yes, this is yet a very high level and “civil” definition of the route, but ultimately this is the birth of insights.

“We haven’t do anything with data yet”:

For small companies it is definitely a big investment to setup even one data related role. Therefore they should really consider what type of roles it should be, who can start implementing data management into the company. My recommendation for the new starters to employ a Data Analyst. Data Analyst is kind of a role somewhere between business and IT who is able to understand the business requirements but also have a solid tech awareness to build it. Let me emphasize here the difference between Data Analyst and Business Analyst as the two may sound very similar, but they are not. Data Analyst is a technical person with good business understanding vs Business Analyst who is a business person with good tech understanding. Business Analysts are also needed for a company at some point, but if you hire a Business Analyst for the data work, you might fall into a pitfall and cannot get to the insight you wished to have.

So Data Analyst is the person who can perform kind of a one-man show for a small company to do some “magic” with the data and deliver the necessary insights for decision making. Regarding the technology that’s needed I would say at this point if the business applications are already there, the only thing that’s needed is a data visualization tool, Power BI from Microsoft offerings – I’ll talk about Microsoft products as I’m mostly familiar with them, but there are definitely other products on the market which can be chosen. Power BI is not only visualizing the data but it has the essential capabilities to connect a pretty high number of data sources and to do a wide range of transformations. In order to share the content with the team some licenses are needed, but a Pro license is $10 per user per months, which is definitely affordable even for small companies. Ultimately here is how they can start their data driven journey:

“We have some reports already but we would like to increase data driven mindset and enhance data availability”

For middle size companies a data team is a valid option. One person is already not enough. In those companies there are multiple departments and all together few hundred employees. For those companies the demand for competitive data solutions is higher than for start-ups, and it should be supported by an adequate sized team. I would say for a medium size company the team should consists of 2-4 professionals with the following roles. Data Analyst / BI Developer is still essential who is/are ultimately creating the insights for the business, just as above. In the downstream part some other roles might also be considered at this stage. Data Product Owner is a person, who is also have an attitude of being between business and IT, but this role is more involved in having a closer relationship with the stakeholders, gathering their requirements, presenting them the available solution and drives data literacy for the entire organization. A Data Product Owner ideally understand and knows all data products of the team and acts as an evangelist to spread knowledge. At some companies where appropriate a Data Scientist can also be important. Data Scientists are working very closely with business. Their role is to understand the business and by analyzing the data with statistical methods, understand templates, identify problems and opportunities and present them to business. Data Scientist can also apply statistical and advanced analytics techniques to enhance decision making.

Going upstream, for a middle size company it is also essential to build and maintain proper data collection and storage infrastructure called Data Warehouse and Data Ingestion processes. For this work a Data Engineer role is necessary. Data Engineers are professional who’s responsibility is to gather data from the data sources, transform and enhance them as required and store them in the shape which can be then easily used by the Data Analyst(s) and Data Scientist(s). So this role is a bit more far from business and is in a daily contact with Data Analyst, Data Scientist and with the Source Admins. Data Engineers are pretty technical people who should have experience in a few programming languages, and query languages and the ability to create and maintain a well functioning Data Warehouse. Ideally this person should also act as an architect who not only understand DWH principles, but BI as well to connect them together.

Another role that might already be important for mid companies and who is the closest to the data sources is the Master Data Specialist. A MDM Specialist is responsible for the data itself. MDM Specialist should work closely both with business to understand processes and with System Admins to understand the supporting applications. This role should create a framework of how the data should be inputted to ensure consistency and cleanness of data. MDM Specialist should create rules and apply them to systems making it ready to connect them later on.

At mid size companies there are already pretty few roles that could be fulfilled, some of them are essential, other are optional. Considering the still relatively small amount of team here we should accept that team members would need to wear multiple hats and acting as multiple roles. E.g. a Data Analyst can also act as a Data Product Owner or a Data Engineer can act as a Data Analyst, etc. It really depends on the personal skill of the team members. Some detailed explanation on the how many hats we should wear I’ll create another post later.

Regarding the technologies to be used here the Data Visualization Tools are already not enough. We will need to purchase solutions for Data Storage and Transformation. Azure is providing different options for us, we can use structured and unstructured data storage, like Azure SQL DB, Cosmos DB or Data Lake, and tools to transform data like Synapse Analytics, Data Factory or Databricks. The decision which ones to choose is really a multi-factor decision. Me I personally prefer Synapse Analytics as it is kind of a one-stop shop that involves a dedicated SQL Pool and all transformation capabilities the other independent tools contain. The disadvantage of this consolidated platform is it’s price which is significantly higher than for other tools but the human work’s price I think can balance it. Independently using Data Factory, SQL Server and/or DataBricks might be cheaper for the company, but we should definitely expect more working hours to deliver a data solution.

“We are a big company with a solid data team to support the organization”

For big companies they can and should definitely afford a bigger data team to deliver scalable and performant solutions for the company. Let’s start from downstream. Data Analyst and BI Developers are still essential who are creating the reports/dashboards. In this stage their work is more about working on a specific project. Data Product Owner here is a must and should be an independent role. Data Product Owner should have a close and continuous relationship with the business and the BI Team as well. BI should already be a Team here, definitely several people are needed to deliver the amount of data products for a big company. To lead this team a BI Manager or BI Lead will be essential to streamline the methods and optionally the look of the reports as the collection of the different developments should be at least similar for the end users. Data Scientist is also an essential role here with the above mentioned activities. At a big company a BI Admin will also needed. The BI admin maintains the necessary BI framework by managing accesses, monitoring usage, recommend best practices, etc. At this stage even not as part of the data team, but company will definitely need Business Analysts who are the tech savy people in the business departments preparing stuff for decision making for the executives. Business Analysts here should also have the ability to create their own reports, or some basic reports for their departments. BI Developers therefore should create data models, which can be reused by those Business Analysts and ensure consistency between reports. At this stage reporting practices should be thought to the community very well, trainings, workshops should be held. Self-service reporting is an approach which considers that the ability of creating reports with tools is essential, not only for the professional data team members, but also for “citizens” at least at a basic level.

Moving upstream the company also need a Data Engineering Team. This team also should have a Lead or Manager who coordinates the work of the Data Engineers. A Data Solutions Architect here is a must. This role is responsible for designing the data storage, transformation and data movement framework including the selection of the proper tools and creating the data solutions architecture how those tools and processes with connect to each others. At a big companies not only reporting needs are existing bit also the need for connecting many different systems in the company like ERP, CRM and other satellite systems, Excel files, etc. So the Data Engineering Team also can be separated to two streams, one is who are maintaining DWH for analytics solutions the second is who are maintaining the data movements between systems. The first stream’s activities was already covered by the previous stage, but the system integration is a bit different stuff requires somewhat different approaches and tools. A possible third stream might be ML Engineering who are specifically working on Machine Learning solutions. This role should have a more in-depth knowledge of statistics and math to train reliable models to predictions, classifications, etc. As the DWH at big companies is also big, a DB Admin would also be important. Database Admins maintains DWH, including accesses, performance and costs. They can work closely with the Data Engineers to ensure the best shape of the company’s DWH.

Master Data Management is also even more important for big companies. Some of the companies are having smaller MDM functions who’s responsibility in this case to design data frameworks and to monitor data cleanness, but some companies are setting up biggest Master Data Teams, who will specifically be responsible for the full master data management. E.g. nobody from business can setup new products or customers, but the master data teams and business should request new entity creations through prepared forms and processes.

From a technology perspective the above mentioned DWH, BI and ETL tools are definitely essential. Big Companies should already consider massive licenses/subscriptions like Power BI Premium. Another tool that is already essential is for Security/ Data Governance. Microsoft has it’s Purview that is an end-to-end Data Governance and Cataloging tool which can cover most of the needs here. ML Studio or DataBricks might be already necessary for Machine Learning activities. At this stage auditing is also necessary for which a Log Analytics can do. No big company data activities can be imagines without CI / CD pipelines and source control. There are different tools can be used for this. For BI stream Power BI Service itself provides Dev/Test/Prod pipelines, and PBI also recently started the Project format which can handle our data models as semantic data models and can connect them to the Service and for Azure DevOps or Git. This last two are also to be used for DWH and Data Transformation processes so we can maintain independent developments with reviews and proper backup capabilities.

For Big Companies there can already be more than a dozen different data roles which are all somewhat different from each others. Gartner has a pretty solid explanation back in 2022 about data roles which I recommend if you would learn further: https://www.gartner.com/en/doc/what-are-the-essential-roles-for-data-and-analytics

So this is my approach ๐Ÿ™‚ hope I could help to better understand the different stages how data teams should look like.

Leave a comment

Leave a comment