Open Data Node - first dataset

Short introduction about how you may start using ODN

Peter Hanecak

Assuming you performed Open Data Node (ODN) installation you've already begun long journey of Open Data publication (thus we also assume you've studied our Methodology, performed preparatory steps and have also good understanding of what lies ahead, process wise). This post will shortly introduce one possible way of how to start using ODN to publish data and thus achieve your goals.

Starting conditions

To start, we assume you have:

  1. fresh ODN instance: Please follow

    https://github.com/OpenDataNode/open-data-node/blob/master/INSTALL.md, if you did not already, to install your instance.
    We at our end are using instance of OpenData.sk community available at http://odn.opendata.sk/ .
    If you wish to just try ODN without installing it, you can use a demo instance at http://demo.comsode.eu/ .

  2. an SQL database with some data: At our end, to make things easier, we use a dummy PostgreSQL database with just one table and very little data. You can (carefully) use your existing database(s) or (to be safe) you can create same testing database as we using following database dump: odn-dummy_db.sql

In following text, we will use http://<ODN instance>/ to refer to URL used to access your instance. Replace <ODN instance> with hostname (FQDN) assigned to your instance (we're using odn.opendata.sk as mentioned above). Typically:

  • there is so called public catalogue available at http://<ODN instance>/ which can be seen by all and where results of your work will be foundODN Public Catalog

  • there is so called internal catalogue available at https://<ODN instance>/internalcatalog/ which can be seen only by you (or people you've allowed to) and where you'll do your Open Data publication workODN Internal Catalog

Creating a first dataset and first publication pipeline

The first step is to log into the internal catalogue at https://<ODN instance>/internalcatalog/ , and creating the first dataset entry (click My Datasets and then Add dataset). What you'll fill in will be later seen by general public, but for now is by default private. You'll make it public at the end once you're satisfied that everything is correct. For now make sure to properly describe the data you're going to publish (at least the title and description). You do not have to worry, later it can be reviewed and changed if needed. When finished, click Next: Add data.

Second step is to add data. But while you are going to automate the publication (one of the basic motivations for using ODN), you're not going to upload files or add links to data, you're going to create a pipeline instead. Thus, you click Finish for now and you'll see your first dataset (so far without data).

Instead of adding data directly, you will create pipeline (automated process which will extract data from internal system and add the data to the dataset for you) as follows: you will click Manage, then Pipeline, Add pipeline, Create new manually, fill-in pipe line name and description and click Finish:

 

After that, you will get into ODN/UnifiedViews (the ETL tool), where you can create pipeline (drag and drop DPUs from left menu into canvas, configure and connect them, save the result):

For the pipeline, you will need following DPUs (data processing units):

  • uv-e-relationalFromSql to extract data from the SQL database

  • uv-l-relationalToCkan, which will take data extracted by previous DPU and load it (publish it) in CKAN data catalogue (for the internal one)

  • uv-e-DistributionMetadata though which you can supply information about the actual piece of data you're publishing

For the brevity, we've included only a screen-shot of resulting pipeline. For more details, please follow the screen-cast at the end of this section. You can also take advantage of the ability to import and export pipelines and simply import the one we've exported: odn-test_pipeline.zip

When the pipeline is finished, you can run it. Pipeline will extract the data from SQL database and load it in internal catalogue as so called resource under the dataset. You can see it when looking again at your dataset in internal catalogue:


Remember, for now the dataset is still private, thus you can see it only yourself, in internal catalogue. After you do the review of dataset and all is correct, you are almost finished. Thus the third and next-to-last step is to make the dataset public: look-up the dataset, click Manage, change the Visibility from private to public and finally click Update Dataset.

You can now verify, that your fist dataset can be seen also in public catalogue.

The final step is to truly take advantage of automated open data publication: you now have an automated pipeline ready, so last thing needed is to tell ODN to run your pipeline automatically and repeatedly for example after each midnight, so that updates in internal data are also replicated in your Open Data dataset seen by data users (citizens, NGOs, companies, etc.) in public catalogue. To do that, access ODN/UnifiedViews (via Tools > UnifiedViews), go to Scheduler, click Add new scheduling rule, select your pipeline, select the time when it should run and also select Interval: every day.

 Full screen-cast of above mentioned steps can be seen here:

 
 

Results

With dataset and pipeline created and set, ODN will execute the pipeline after each midnight, thus keeping your dataset up to date without the need for repeated manual work. Data can be accessed and used by general public via public catalogue: