Loading data into Azure data lake

Webinar | Loading Data into Azure Data Lake Gen2 with Azure Data Factory v2

Loading data into Azure data lake
Share on facebook
Share on twitter
Share on linkedin
Share on email

How do you architect and load a modern data warehouse using Azure Data Lake Gen2 and Azure Data Factory v2? In this webinar, our data analytics practice lead, Jose Chinchilla, will show you how to easily load data into Azure Data Lake Gen2 with Azure Data Factory v2. You’ll learn the benefits of both Azure Data Lake Gen2 and Azure Data Factory v2—like faster performance and cost-effective storage—and how they can help you save time building big data analytics solutions.

Learn how to:

  • Author and monitor data pipelines with Azure Data Factory v2
  • Optimize the design of a big data analytics solution with Azure Data Lake Gen2
  • Accelerate ETL/ELT data load patterns into Azure Data Lake Gen2 using Azure Data Factory v2

Transcript for webinar:

[00:00:05] Madison Hi everyone and thank you for joining us. My name is Madison Earhart and I’m on the marketing team here. AgileThought just a little bit about us. We’re a custom software development and consulting company that works primarily with Fortune 1000 clients. We help enterprise companies and vision deliver and maintain software solutions that are critical to their business. We were also recently named a finalist for the 2019 Microsoft A.I. and machine learning partner of the Year award. And we’re really excited about this recognition. So today I’m joined by our presenter who’s a chinchilla and he is our data analytics practice lead here at AgileThought. So during this webinar who’s a who’s going to teach you the benefits of Azure Data Lake Gentoo and as your Data Factory V2 as well as how to save time building a big data analytics solutions. So before we begin. Just a few housekeeping items were randomly going to select three lucky winners to receive a one hour consultation with hose. So the winners can use this consultation as an opportunity to discuss their data migration project to Azure. They can also discuss different migration strategies with ETF workloads to Azure and discuss data lake implementation and we’re going to email the winners after the webinar with next steps to get this cancelled scheduled. So now I’m going to hand everything off to you.

[00:01:33] Jose All right. Thanks for an introduction Madison. So like Madison mentioned I lead the nodes theme here and AgileThought.

[00:01:41] Jose And I’ve written a couple of books and have earned several certifications including the FMC ACM GSA and the latest one micro certified trainer. So I’m happy you can join us today.

[00:01:53] Jose So today we’re going to be short and sweet. We’re going to talk about some of the data leg design considerations for your organization. We also want to talk about the store generation to ensure that a factory version too. And just to recap some of the high level benefits and capabilities of both services and then we’re going to do a quick demo on how to assure that the store generation to we actually did in fact through aversion to using the new data flows which is still in preview for us fidelity. So why data leak. Historically a organizations collect lots of data for their day to day operations including sales transactions CRM financial information and most of these data has been historically structured data. When I’m in structured it’s typically in some type of relational single database as well as some other sources web services application and so forth that most likely have some type of structure schemas behind that application. In recent years there has been a influx of lots of different formats and variety of data including social media data video images sound in text documents that are not easy to structure in a relational type of database. So combining all these sums to structure on an unstructured data along with the volume of data. This is why there has been a shift in the way we structure our analytical stores typically look at their house and moved into moving to a data lake type of analytical storage.

[00:04:02] Jose Now with that said now we can enable our data analyst to connect to all sorts of data through a common story.

[00:04:12] Jose You don’t have to go some different places different connections to create to be able to aggregate all these data analysis and insights from it. Differentiate between a debtor has Veronique These are the three important points and comparison points to keep in mind the debtor house typically has been created with a schema already defined already a structure in order he has data that’s been processed. Typically you go through an ETF process where the data gets extracted from the data source information gets applied and then they are written into a pretty fine data schema. A boat as opposed to a data lake which accepts any type of data the structure and structure and doesn’t have a predefined right schema. And typically when applications connect to a debtor league they will connect and define a schema. That time of three these are defined then commonly known as a schema on read. In addition to that in order to scale in how high availability this recovery that our house. You need to have a plan in place. You need to be able to forecast the growth that our house the number of users in the performance and your excellence in plan. All that out beforehand with data leak and specifically with Azure daintily you have an limited up scalability. Essentially you can grow into petabytes or even hex a vital data as well as support a high number of users in the performance of the throughput. We’ll also scale linearly along with the storage. In addition to that you have already old high availability disaster recovery building from Azure services from locally redundant redundant storage to global probably redundant storage that you can synchronize your data to other data centers around the world. One of the big with his region comes to a data lake is the architecture you’re gonna put in place along with the file format content physical design security and the frequency of the data you’re going to be loading. So these are the four more input more important the same consideration that I have gathered from all the data implementations we’ve done specifically in architecture we’re talking about consolidating all these data into a single data link or are we creating multiple data. So sometimes referred to data points where you have small data March small sections of your data for specific TV shows or business units in addition to that as part of the architecture and to consider what data that needs to be available almost immediately or in real time versus data that can undergo some transformations. Some staging some processing in it not be as real time. So that data in in the net architecture you implement what are known as hot or customs hot for being that more real time more recent data and call stream for that data that undergoes some type of processing that is not made available immediately to the users. And when I talk immediately or a real time when talking about under five minutes under ten minutes that depends on the organization. In addition to that we talk about a file format and content is the content of these data going to be in a raw stage in a process or similar process stage as well as what is the format of the files. Typically most organizations choose a common separate body or text limit file format by their more advanced type of format that are geared to faster performance such as parquet Agora or some Jason Hummel binary type of files. So that depends on the use case in the tools are you going to be using to connect to these data. Was physical design one of the considerations here is how are you going to organize your data. Now we did a store generation to that we will find that in a little bit. You can actually define a file system with a hierarchy defined so folders some folders and files and then you have to also consider the security aspect. How are you going to secure the data you’re going to open it up the entire data link to the entire organization. Or are you going to be very prescriptive on who has access to what different different type of folders you’re you can achieve these now with roles or access control list. Just like a regular file system and then the frequency like I mentioned is these data are going to be loaded kind of in a real time type of scenario or use case or a loading this data in an hourly daily type of fashion. So all these considerations have come into play when you architect your data like from how meta that data is going to be available how you going to write that data to how we’re going to secure it in. How often even loaded. In doesn’t if I should that storage and storage generation to there are some new features and capabilities that have been added in part of these capabilities is a combination of blob storage along with eight units or actually that store generation one. So Microsoft brought the best of both worlds together into what’s now called for storage generation. And one of the biggest improvements here is that now in you support or Microsoft supports the ability to let a hierarchical namespace essentially a file system with directory sub directories which allows you to create define very granular security both at a service level or the filesystem directory level down to the file level. If you want to be very granular and all of these important ways roll based off of vacation access control list and a sampling so percent of postings permissions. In addition to that part of these hybrid best of two worlds of love stories and less is the ability to peer your storage both in capacity in terms of how much data you’re going to store. Like I mentioned we can scale up to exabytes but also the throughput that goes along with it and the ability just recovering. So all of this is built in and supported in actual garlic storage generation to. So now why is it important to talk about factory we talking about loading a deadly so factory. As you may already know it’s a data integration EDL or EEOC type of web service where you can run a schedule run and monitor pipeline expansion. And currently there are still versions version 1 which is the first iteration you have a reach user interface to develop pipelines. Now we’ve actually affects reversion to which is the current version. You have a reach web based development environment with a very rich and intuitive user interface similar to SSIS in them in the way you drag and drop activities or transformations into your workspace and then connect them to create a control flow or data for. Which took it about data flows which is still not previewed. This is the ability to create a very logical step by step transformations on your data. I have very similar to SSI is. In our factory uses as you get a bricks as its compute power. So it is very scalable and transformations to massive amounts of data and actual data breach in the back end will paralyze and scale these for you. One of the great additions you do. I should get a factory capabilities is the ability to not only connect to cloud or Web based data sources but also on premise data sources using what’s called the cell phones run. And if you already have a lot of work load a lot of work and effort into SSIS packages you can actually deploy these packages to actually get a factory work to aim execute them monitor them just like you will. From the article it is kind of. More specifically talking about some of the features of isolated effects reversion to his idea of a developing workable pipeline is similar to SSIS. Like packages and pipelines contains one or more activities it can contain a specialized activity which is previous Kearney called data flows and either license data sets as a source or a destination C to load extract and then load the data into and this data sets are defined by length services which are like connection managers that allow you to define things like what type of sources sources to many connecting to credentials and then be able to define datasets from those linked services. And then at the end we also have the integration context which we talk. These are specialized compute systems that allow you to either do some work over a data source or kick off some jobs in other specialized systems such as machine learning data breaks spark etc.. So with that we’re going to jump into our demo and we’re going to load a ventilator using national data factory aversion to. So what I have here is I have a sequel server data source which I have my sales order line sales order. And what I’m trying to do for my sales team is to create summary files everyday about who we sold our products to.

[00:15:21] Jose What products. When did we sell them. And the amount of orders.

[00:15:28] Jose So for that I’m going to extract the data from these SQL database and I’m going to load these assured data storage. This is a lazy generation to type a storage account. And I have a file system here where I know data from my final scheme and then I break down the data into different folders such as better financial forecasting planning files marketing team as well. We have some folders defined for them and then for our sales team. Several folders specifically the sales summary folder which is empty at this point. So I will not create a pipeline answer to the factory that would extract a summary of the data from our relational database and create a file system here that our sales people can navigate through and be able to get that data. So that’s it. When I open my Usher portal in my Azure portal as you can see you have a Data Factory and this is a Data Factory version too. Once I click on i that affects regression to I didn’t click on my authority monitor shortcut which will open my authoring and monitoring environment. I sure did a factory. So I’m going to go into my author section and I can see several artifacts that I have here in my I share that a factory. I have a bike lanes I have my datasets. Data flows in templates that I can create specifically. Today I’m going to talk about a pipeline that I created called the self summary with data flows. And this pipeline is a very simple pipeline that allows me to define transformations on my data and be able to aggregate them look up data and then load my I should their factory. So what I have here are very high level when I describe the high level what we have in these data flow and then we’re going to kick off this pipeline and wait for the data to be created. And then when I go into a little bit into the details what is slowing. So the first thing I have here is two data sets. Sales Orders sales orders that I’m joining you can see that the John here is a INNER JOIN on my order these. Then I have my aggregate what I’m grouping data in aggregating then I looking up the customer names look customer product names and then I’m driving the following that’s going to be generated in loading it into my Azure derelict store folder. In this case the data is going to be loaded into this file system called a demo. If it’s under my sales sale summary folder and what you’ll see here is a bunch of directories that I’m gonna explain in while it’s loading. So at this point and when I kick off this pipeline I click in the debug button. As soon as I click the debug button I can see that my pipeline is executing is telling me what steps it’s running. So right now it’s running my audit log start process which recourse when this pipeline got executed and also some metrics on how long it will take. Once I log my the end of my execution. So all this is running I’m going to go back into my data flow and this is a very simple pipeline. Like I mentioned joining these two data sets together and then doing an aggregate in aggregate is essentially like doing a group like this are my group my columns. So and then I have my aggregates which is a very simple mathematical expression using the expression language in ASH that I fired through. There’s a difference in text and it is a yes or syntax in other languages that you’ve seen but it’s a very simple language syntax. In this case the multiplying quantity times unit pranks from my sales order late and then summing that data up by my group by Carlos. In this case my stock in my product I.D. sales order and order it now that I have that I want to look up my customer money in in here I am referencing a customer data set as mine look upstream which is done here think as you can see yes I hover is it highlights both the actual data and the reference to that is it similar to product name. Sorry I’m using my product dataset. And is being reference might look up to look up my stock product name at the end. What I’m doing is I’m creating a drive column and that there are economists called filing and what that funding is. If you look at the expression although it seems a little bit complicated we can see very quick the output of that expression and that expression is essentially deriving a the hierarchy that I’m going to be writing to. And the actual filing there’s going to write to the actual derelict store. So we’re gonna wait for that to refresh and we should see the example of the. File names. There’s gonna be created. So Well that’s refreshing because probably is competing with the actual pigment execution running in the background. Let’s look at our I sure did. I sure a lake storm and refresh the data here. We see that and we’ve already already created the folder structure that I’m looking for. And for example we can navigate through 2016 and it also created folders by date and then we can see in one of these folders the sales order or sales order line summaries created for each of these data folders. This is simple data coming from white all importers database so there’s no transaction for every day of the day. But for those days you can see that there’s a file created. Now we can actually download this file and we can see the contents of the file. We should be able to see the. Data in our files that we generated so that’s does as we file that we download it and we can see how we can consume that data really easily so we can see the output. You can see the order rate order the customer any sales amounts to me. And then if you see the data all these orders were orders for May 2nd 2016. They are all stored in the May 2nd 2016 file inside my 2016 year. I could create a more complex hierarchy here so I can do year and I could get lost or quarters and then break those down by day. I could also just create one month and have a daily file under that month. So this is very flexible and you can programmatically do this using the expression language as I was showing you here. So while our pipeline execution completed we can see the output of this. We cannot open is really quick in No.

[00:23:23] Jose And we can see that for these different order dates. This is the file name we created. And we can confirm that that was the actual file names that were created as well hierarchy that we define.

[00:23:41] Jose So that is a very weak example of how we can load a debt over a house having all the data. Logical transformations in our beautiful and then dynamic generating filing that we can write our data and it will follow that hierarchy based on this syntax in our expression language and then loading too. I sure did Alex store. There’s other options that we can have for loading it. I sure did I like the one I define is the file name option was generated as part of data in a column from my data and that is from that following variable or feel that I generated in my right column. But there’s other options here where you can have your data breaks which is the compute power behind the scenes generate your file names based on that. The default settings for US data roots based sort a pattern or partition or just consolidate everything into a single file. That is if you want to hand this off perhaps to a machine learning and data science thing you want to just give the one file with all the data consolidated so they don’t have to traverse multiple files but in most cases data stores will be defined like this where you have some type of subject area or deviation or bucket and then subsequent sub directories or folders for more specialized areas for analysis and then typically you would have a Year Month date year quarter month data center as your file system hierarchy.

[00:25:35] Jose So that’s the very simple execution of a shift should a factory implement to load a should elect store and then concludes the demo for today. So we’re going to go back to our questions and answers when I think Madison has also some insight for those who want a one four console.

[00:25:58] Madison o we actually received a few questions in advance from you all so we can try and get through as many as we can. And if we don’t get to your question please again email us at Marketing@AgileThought.com. The first question we have is what incremental loading and partitioning strategy do you recommend with Azure Data Factory and data lake.

[00:26:25] Jose Great question. So in terms of incremental loading one of the big topics is detecting changes from your source systems several sources items for example sea floor or other relational databases have built mechanisms to track changes over time in your data. For example SQL Server has changed tracking or changed that a capture as well as Oracle and other vendors platforms. And in addition to that it will make use of those change tracking change and capture mechanisms. You build your own incremental loading logic that could be based on a transaction date or some data columns. For example if your story the time that that record was insured it was a tape and you can make use of those meta data columns in your data sources to derive your own incremental loading and change tracking logic. Other systems don’t have that chain tracking mechanisms or those methods to close available. So the only option there will be to stage it and then do comparisons against your existing data and your data lake doing things like cache columns that will identify there’s a change in a particular column or row data. Other mechanisms that you’d actually have to track changes or identify changes you’d only append data has changed kind of looking at journal type of scenario. So there’s several options there but you have to have some type of incremental loading along with chump change striking change in a capture mechanism.

[00:28:22] Madison Okay great. Thanks. And moving onto our next question does Azure Data Lake support file formats other than plain text CSP.

[00:28:34] Jose Yes. So in the demo I show what I see as V text files with comments are pretty valid. You can have other dilemmas as well spaces or type but files you can have more specialized type of file for months like I showed in mine. It’s like that usually one of the most common one is for an average. But there’s other like or I’m just playing Jason files and files that are typically used but all that only basically use case in the tools that you have available to connect to your daily net. All tools support parquet or Alvaro type of files out of the box. So that’s a big design consideration. Looking at your toolset and the capabilities of those applications or tools. So for fast it’s I would recommend work and but you have to have some typos dispensation and process where either you stage the data in text file for your users or some type of interface that allows you to read those files into a more readable format for some what tool they users are utilizing.

[00:30:01] Madison Great. Thanks for answering those. That concludes today’s webinar. Thank you all for attending and we were going to send you this webinar recording in case you’d like to share it with your co-workers or your friends. And we will also be following up with the three winners of our consultation with Jose. And if you have any questions please send them our way. And we all hope you have a great rest of your afternoon. Thank you everyone. Thank you.

Stay Up-To-Date