Excel usage is still quite widespread and is being used by many people across all industries to maintain their data. These data will need to be used to create various reports after joining with other application level data which are used by their relational databases.
These excel files may come through various medium; Email, Sharepoint uploads, FTTP file shares etc.
I will not address this in this discussion. This discussion will solely be about the extraction of data, rather than downloading part, using Azure resources.
In consolidating different excel files and even processing the same regular file, reporting people run into a lot of problems:
- As these excel files are being maintained by non technical people and due to the lack of implying relevant security, the structure of these sheets (number of sheets, number of columns and their names) may change.
- Data formatting is not done properly and can change with every file.
- Size of data – specially data that has been maintained over years, the file size can run into 50-100MBs or even more.
There are several approaches that you can solve the above problem, and it is up to the specific organization to pick which suits the best:
- Azure Function App – functions make use of code and can be used to carry out many complex activities including file processing. You can choose from a variety of languages including c#, python, Java and so on. These functions can be processed on demand and can be triggered by any service, within or outside of Azure. You also pay only when it is invoked. This is a cheaper option compared to ADF if this is used for smaller activities. But processing huge files that take more time might be a problem and may get more expensive than using ADF.
- Azure Databricks – Just like the function app, this is also code first and is serverless. You can again choose from Python, Scala, R and SQL. And this runs on a spark cluster. This clusters can be invoked within a job within databricks or can be invoked by an external service like ADF. Reading an excel file is so much easier using databricks. Due to it having a code first approach, this is very flexible and can process dynamic files. On top of that, this also can be used for big data processing, streaming data processing and data science. Hence, it is not only capable of processing large files, but it is able to do all Extract, Transform, Load functionality within a databricks notebook. You pay only for the time the cluster is active.
- Azure Data Factory –
- Copy activity – as the name suggests, it copies data from one place to another. So if you have a simple excel sheet, which you know will not change, and you know exactly the sheet names, yes, go for it by all means. It is cheap and fast. However, when complexity adds to the file, this will not be a viable solution.
- Data flow activity – you can do visually designed data transformation. No coding. You can have different activities within a data flow, that consists of source, destination and other data transformation activities that enables you to manipulate an excel file. If you have worked with SSIS, data flow in ADF is quite like a data flow task in SSIS. The data flow activity can be executed as a part of a data factory pipeline that use Apache Spark Clusters. You can also execute very large files using a data low activity. But, this is a bit behind functionality and you can not achieve everything you can using databricks or function apps.
So what should be your go to choice? I’d say depending on the type of files, and processing required and your experience, you can select the appropriate approach.
However, whichever the option you choose, I would still stick to ADF as a mediator to orchestrate the flow that will connect with different data sources and trigger my choice of work (Databricks, functions etc).
Leave a comment