Efficient dataset replication with ODN

ODN

Imagine you have a bigger dataset which changes frequently (records get added, changed, deleted daily) but it lacks any information about what and when changed. While publishing such dataset, you face two typical problems:

  1. it is hard for users to see what was changed (imagine you have one million entries yet only few of them get updated each day)
  2. it is hard for users to get updates for dataset: they have to resort to download it whole repeatedly, which then stress theirs but most importantly also your resources (imagine hundreds or thousands of users downloading few hundred megabytes from you each day)

Open Data Node (ODN) can help you with that: you can use uv-l-relationalDiffToCKAN DPU to detect changes in your dataset and publish “enriched” dataset, where changes are marked.

How it works

Lets say your internal dataset (table) looks like this – very simplified example:

Screenshot-table-1

(For real world, imagine you have also other columns and much much more rows in your dataset)

Lets say it is published as CSV file, the easiest publication method available. People can download it and either look for changes or use some tools to look for differences (programmers may use `diff` tool). In such simple example, spotting new lines or missing lines or lines where content of “data” changes is trivial. But again, with more columns and much more rows, this task gets difficult.

So, to help users users spot the changes, you can construct a pipeline in ODN/UnifiedViews, which will:

  1. extract your data from say internal SQL database using uv-e-relationalFromSql DPU
  2. detect changes, enrich the data and publish the result via ODN

Screenshot-relational_pipeline

uv-l-relationalDiffToCkan DPU will produce in this simplified example something like this:

Screenshot-table-2

  • it will add ‘modified_timestamp’ and ‘deleted_timestamp’ columns
  • it will set ‘modified_timestamp’ for each new or modified row to the date and time the change was detected
  • it will set both ‘modified_timestamp’ and ‘deleted_timestamp’ for each deleted row to the date and time the change was detected

Changes are detected when pipeline is running, thus if you have say few updates in dataset per day, you can schedule the pipeline to run each night.

Users can now detect modified rows by looking at ‘modified_timestamp’: knowing when was the last time they looked at the data, they can filter only rows which got changed since then. Maintaining also deleted rows is important too: if they get simply deleted from published dataset, how will users learn something is no longer used and was removed? Well, it is a change, so it needs to be tracked as change. Thus users will recognize such rows first by knowing they got changed (thanks to modification timestamp) and then determine those which were deleted by looking ‘deleted_timestamp’.

Thus far, we’ve covered the problem number 1 (how to spot changes). Now, how does that help with problem number 2 (repeated downloads of whole big datasets)?

As ul-l-relationalDiffToCkan DPU is loading data into CKAN DataStore embeded in ODN, users (programmers) may use API to get the data:

  1. on first attempt, simply downloading everything
  2. on subsequent attempts asking only for most recently updated items and downloading only those changed items

Thus in cases when a dataset is huge (say few hundreds of megabytes) but changes are small (say only few items per day), users will in subsequent downloads transfer only small fraction of the whole dataset, saving themselves valuable time and yourself (publisher) server and connectivity resources.

To better demonstrate this, we’ve prepared also proof-of-concept harvesting application (implemented in Python, under BSD license):

https://github.com/OpenDataNode/poc-harvester-app

which can be freely used for study, experiments or as base for dataset replication in data user’s applications.