-
Loops in Databricks or Parallel Processing
Usage of loops and conditions is unavoidable in both application development and data processing. However, for a data engineer, this becomes a nightmare if not handled correctly. I will not be deep diving into how loops are handled in Databricks. However, in short, the problem is that, the same block of code is processed one… Continue reading
-
Databricks – Processing Geographic Data for Australia
Little bit about shapefiles; One way of storing Geographic data is using a shape file format. Shape file format is created by ESRI which consists of vector data. ESRI Technical Documentation describes the shapefile as something that stress nontopological geometry and attribute information for spatial features in a data set. Geometry for a feature is… Continue reading
-
Azure – Databricks Integration with GIT Devops
Below writeup demonstrate how you can use source control using GIT on your databricks code. Azure Databricks has it’s own facility where you can maintain your code, no doubt. However, if you are using multiple technology (specially in a BI project, your bound to use ADF, MS SQL, Azure SQL server and so on), it… Continue reading
-
Access Blob Storage from Azure Databricks using Azure Key Vault
To access the blob storage from databricks, you need to add a secret scope. A secret scope is a collection of secrets identified by a name. There can be 2 types of secret scopes: Below is the step by step guide on how to access blob storage from databricks: Assumptions: Azure databricks and Azure storage… Continue reading
-
Azure – Process and extract excel files
Excel usage is still quite widespread and is being used by many people across all industries to maintain their data. These data will need to be used to create various reports after joining with other application level data which are used by their relational databases. These excel files may come through various medium; Email, Sharepoint… Continue reading
-
How to use the transformations in a Dataflow to optimize the performance
In SSIS, we have many different kinds of transformations that we can use to cleanse and restructure our data in the way we want before sending to our destination. However, key thing to note is that behind our designer, all these transformations do not behave the same way. Each have their own way of… Continue reading
-
SSIS Variables Vs Parameters
Many people are quite confused on how parameters and variables work. However to clearly demonstrate the differences, you need to deploy a package in the SQL server. Parameters There are 2 types of Parameters: Package Parameters and Project Parameters. The only thing to that differentiates in these 2 types is the scope. Project parameters… Continue reading
-
Accessing FTPS using SSIS
Although there is a FTP task within native SSIS, it lacks both SFTP and FTPS tasks. There are 3 methods to access: 1. Using third party components – ex: zappySys, Cozyroc etc. 2. Purely code based 3. Using command line tools Using a third party component is pretty straightforward if you have an understanding… Continue reading
-
Surrogate and Business Keys
Looking at my previous write up on BI solution purpose, there are 4 main requirements that can be expected out of a dimension table: They should obviously be linked to the facts or the business process. Since we are getting data from so many different back end source systems, we need to keep track of… Continue reading
-
Purpose of having a separate BI solution.
Several questions, answers which I have not written previously, partly because I was not interested in writing what I know and what I feel, my view points. What is the purpose of having BI – to analyse data, aggregate, slice and dice, identify trends, correspondence that will enable people to make decisions, from simple marketing… Continue reading