Our Data and AI Team have authored several articles discussing Azure Synapse and there’s no doubting that they think it’s great! One of the main themes they have discussed is Synapse, being an environment, which brings together different technologies or services into a single environment for the processing of data and gaining insights from it.
But they wanted to dive into a little detail. Our Director of Professional Services, John Gamble, explores what these technologies and services are that form part of a modern data platform and what is the minimum you need to get going.
‘In terms of best practice for a modern data platform we would expect the following to be in place:
1. Data Export and Orchestration Mechanisms
This is the ability to extract data from source platforms. Modern organisations consume many IT services to carry out their day-to-day operations. Many will have some combination of General Ledger, ERP, CRM, PSA, and other core Line of Business applications depending on what type of industry sector you are in.
Some of it will be on-premises and some will be in the Cloud using IaaS or PaaS services. Increasingly, specialist software for HR, marketing, accounting, training, and other services has also emerged on subscription basis using PaaS models. The modern enterprise is complex.
Despite this complexity, there will be ways to get data from all these platforms. Some of it will be through extracts, some through APIs, some by connecting directly to underlying databases.
You will need some Data Export Mechanism to connect to the application/service, authenticate, then extract the data. These mechanisms will work on either a scheduled “pull” or source-system based “push” approach.
Example technologies include Azure Data Factory, Azure Data Bricks, SQL Server Integration Services, Power BI: Power Query / Data Flows.
2. Data Staging Area
This is an area to store, stage and process data prior to it being ingested into data models for consumption. In modern cloud hosted data platforms this is often called a Data Lake.
Data in the Lake is typically landed in its raw format then processing is undertaken to cleanse and prepare it for ingestion into data models passing through several stages:
Bronze – Raw, uncleansed
Silver – Cleansed (e.g., data types defined, data quality checked, common issues resolved.)
Gold – Fully curated, standardised, and ready for consumption by data modelling or reporting tools.
Processing is either performed on a batch (scheduled) basis with routines picking up whatever is new; or based on streams which can handle real-time processing.
Depending on your requirements a staging area could also be a separate SQL database environment, but really it should be able to handle both structured (e.g., data extracts) and unstructured data (e.g., documents).
Reporting can be built on data direct from the Lake, or the data can be used as a source for further processing and to be fed into more structured data environments such as data warehouses or data marts.
Often the orchestration tools, described in point 1 will be used here to process data.
Example technologies include Azure Data Lake, SQL Server, Power BI: Power Query, Power BI Data Flows (this uses Azure Data Lake Gen 2 behind the scenes).
3. Data Modelling Area
This is where the data from the Staging Area is restructured into the Data Models or Data Sets that business users can consume. This is done either using SQL models, Analytic tools “cubes” or other style of models as requirements dictate.
Key to understand here is that the data held in the models has already been through the staging and cleansing areas and so is ready to be added to the existing data models. If you were building a data warehouse or data mart, then the star schema models would sit here.
Example technologies include Azure SQL Server, Power BI: Data modelling component, Azure Analysis Services.
Data Analytics/Reporting Area
This is the reporting/analytics suite of technologies. Users can either consume pre-built dashboards or reports; or build their own reports using curated, pre-built, consistent data sets.
There are many tools which can connect here to the data models, or indeed curated (gold) Lake data and the most appropriate tool will be determined by your requirements. Regardless of vendor they will all be equally wonderful or terrible based on how well you stage and model your data.
Example technologies include Power BI, Tableau, Alteryx.
So how does this relate to Azure Synapse?
Quite simply Azure Synapse brings all these technologies and a few more (e.g., Notebooks, C#, PySpark) together into one environment for Data Engineers to work.
Finally, what’ the minimum you need to get going?
It is true that many modern analytic tools can do much of this by themselves. Tools such as Power BI, Tableau, Alteryx, etc. can all connect to different data sources and to a greater or lesser degree ingest, process and present information to end users in a manner they can work with.
Whilst this is great and depending on your requirements in some cases enough, I would caution that the complexity of the modern enterprise and source systems does mean you may need to consider components like the above to get holistic and standard view of your organisation.
Using just analytic tools by themselves can create “shadow IT” where key business processes and data flows are being run on desktop machines, have key person dependencies, and are not properly supported by IT departments.
This presents a key business risk. It should also be noted that Data Engineering is a recognised software discipline; and automation of data loading routines, along with standardisation and well-designed UX dashboards using business terminology remain key for end user adoption.
Be mindful of the tools that promise to do it all, there’s no magic bullet despite what the vendors and salespeople tell you, please don’t run core business processing on desktop software, and regardless how good the end user tool is managing your data and conforming it into well designed models is the key to making it really sing.’