Keep only the most recent row of data in data factory

I am using Data factory to create our staging area, the problem is whenever source data changes, we add a new row to staging tables.

For instance, assume we have the following data:

  • Unpivot with column name
  • make EF map byte array to binary instead of varbinary
  • .Net vs SSIS: What should SSIS be used for?
  • Service Broker messages start to get hung up after about a day
  • Why should I use int instead of a byte or short in C#
  • SSIS Excel Connection Manager failed to Connect to the Source
  • ID          Fields             created              edited
    100        ----------        '2017-07-01'         '2017-07-05' 

    this will be stored in our staging tables like this:

      ID          Fields             created              edited
      100        ----------        '2017-07-01'            null 
      100        ----------        '2017-07-01'         '2017-07-05' 

    Selecting the most recent row is expensive and We don’t want that. How do you think we can avoid storing duplicate IDs in staging?

    I assume on creating the pipelines, there should be a way to update the data if the ID already exists in staging.
    the query format in data factory is like this:

    $$Text.Format('select * from <<table>> where <<column>> >= \'{0:yyyy-MM-dd HH:mm}\' AND <<column>> < \'{1:yyyy-MM-dd HH:mm}\'', WindowStart, WindowEnd)

  • can fuzzy lookup and fuzzy grouping operations be performed in azure data factory
  • 2 Solutions collect form web for “Keep only the most recent row of data in data factory”

    Here i think most recent record is considered by edited column ,Try this below code

    ;WITH CTE (ID,Fields,created,edited)
    SELECT 100,'------------' ,'2017-07-01',   null     UNION ALL
    SELECT 100,'------------ ','2017-07-01','2017-07-05' 
    SELECT ID,Fields,created,edited FROM
    SELECT *, ROW_NUMBER()OVER(Partition by ID ORDER BY edited DESC ) RecentRecord FROM CTE
    WHERE DT.RecentRecord=1


    ID   Fields         created     edited
    100 ------------    2017-07-01  2017-07-05

    I found a solution to my problem. It goes like this:

    I added a temp table into our data warehouse, before inserting data into Staging, first data comes into these temp tables, I remove those which already exist in staging tables then do a fresh insert. In this way, I will always keep the most recent data in the staging area and I won’t have to use row_number() functions when joining staging tables to create DW.

    I am not sure if it’s the best approach, but it’s working for me.

    MS SQL Server is a Microsoft SQL Database product, include sql server standard, sql server management studio, sql server express and so on.