Open data module at BFH 2018

The intent of my lecture in the Certificate of Advanced Studies for Data Analysis at the Berne University of Applied Science is to present a practitioner perspective as well as some introductory background on open data, the open data movement, and several real-world projects - with details of the data involved, legal conditions and technical challenges. This post is a refresh of part II of last year’s course notes, the introductory lecture being changed to a lesser degree.

If you’re interested in taking the course, you can sign up for a future semester at CAS Datenanalyse | BFH

In the first week, we start the module by considering how Attention to certain types of questions leads to virtuous cycles of data, information and knowledge, and how the opening of data activates this cycle. We covered Definitions of open data, as well as the various types of licenses, guidelines and publication standards involved. Then our focus was on Switzerland, the origins of the open data movement here and what opportunities and challenges exist here in regards to public and government data.

After discussing the role of the Community in validating use cases, we learned in part II how to use open data ourselves in a Hands-on way, looking at what happens behind the scenes in open data portals and trying out some open source tools on datasets researched together. But first, we began the class with one minute of silence for Alain Nadeau, a Swiss pioneer of open data whose obituary was posted just a few hours earlier.

The screenshot above is of an example script shown in class, suggested as a homework assignment last week: to use the ckanr library to run a search for and directly download open data from the opendata.swiss portal’s CKAN API:

install.packages("ckanr")
library('ckanr')

# Initialise the CKAN library with a remote portal
ckanr_setup(url = "https://opendata.swiss")

# Run a search to get some data packages
x <- package_search(q = 'name:arbeitslosenquote', rows = 1)

# Note that on the Swiss server the titles are multilingual
x$results[[1]]$title$de

# Get the URL of the first resource in the first package
tsv_url <- x$results[[1]]$resources[[1]]$download_url

# Download the remote (Tab Separated Values) data file
# ..and parse it in one step
raw_data <- read.csv(tsv_url, header=T, sep="\t")

# Draw a simple plot of the first and second column
plot(raw_data[,2], raw_data[,1], type="b")

As “bonus”, students were also encouraged to try the “next generation” library datapackage-r described in the Frictionless Data Field Guide, with one of the data packages on datahub.io or openfood.schoolofdata.ch. Students reportedly had difficulty getting the (still new and not thoroughly tested) library to work. I explained that the example code used in datahub.io datasets uses the jsonlite library directly to work with the Data Package specification, and will switch to the official Data Package library when it is mature enough. Here is a code snippet from datahub.io:

library("jsonlite")

json_file <- "http://datahub.io/core/cofog/datapackage.json"
json_data <- fromJSON(paste(readLines(json_file), collapse=""))

path_to_file = json_data$resources$path[1][1]
data <- read.csv(url(path_to_file))
print(data)

We then went through the CKAN code in some more detail, and made some additional searches which led us to the issue of indirect linking. A dataset that interested the students (“Schutzwald”) was labeled as “CSV format”, however the Download link takes one to a separate geo-portal, on which it is possible (though not very intuitive) to get access to CSV formatted data. This can be quite misleading for developers, as we saw, who would expect direct links to the data.

To round off our technical discussion, prompted by student interest, I also ran a demonstration of the Wikidata Query API (covered in some detail here), working through some SPARQL requests, and explaining the interface - making note of it’s facility to generate code snippets such as this one to make use of Linked Open Data in R through the SPARQL library:

library(SPARQL) # SPARQL querying package
library(ggplot2)

endpoint <- "https://query.wikidata.org/sparql"
query <- '#Cats, with pictures\n#added before 2016-10\n#defaultView:ImageGrid\nSELECT ?item ?itemLabel ?pic WHERE {\n  ?item wdt:P822 wd:Q726.\n  ?item wdt:P18 ?pic.\n  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }\n}'

qd <- SPARQL(endpoint,query)
df <- qd$results

The difference between the CKAN-based opendata.swiss portal and other CKAN portals like the old DataHub.io, as well as the competitors OpenDataSoft and Socrata, and the incumbent Frictionless Data platform, were outlined to the participants, and in addition to covering some of the main features of open data portals, we discussed how to use R effectively to work with datasets from different sources.

We then had a discussion on what makes an exemplary open data project. During the introduction, I mentioned the School of Data workshop that took place the previous day, and explained the way that our regular hackathon events worked to crowdsource ideas and prototypes from the community. We looked in particularly at last year’s Open Data Day hackathon organised by the Zurich R User Group. There, teams like Predict Delays used open data sources to create analytical models and publish them in the form of easy-to-use Shiny Apps as well as sharing the open source code on GitHub.

These hallmarks of open data development led us to launch into a mini-hackathon during the second half of the class, inspired by the make.opendata.ch events. We divided into teams of 3-4 people and took up roles (Expert - Designer - Developer), brainstormed and researched open data sources, and built rapid prototypes with a ticking countdown clock. About 45 minutes were spent on the whole exercise, which was followed by presentations and discussions of everything that was found - and, more interestingly, a hard look at the barriers which prevented teams from getting closer to the challenge they picked or using the data they wanted.

Notes on each of the 4 topics we addressed and the ideas, references and datasets researched and presented by the students are shared in an internal Etherpad. I was impressed by the student’s tenacity during this exercise, willingness to work and learn from each other, and the observations we made from the group presentations made for excellent closing material to this semester’s Open Data class.

Many thanks once again to all of the students who enthusiastically took part in the module, to the BFH staff and my fellow teachers of the CAS.

My updated slides and notes in Markdown format from the latest course, are online. Brief summary of new content, continuing from the above:

We spent more time this module on Linked Open Data, covering the basic principles of the Semantic Web and RDF. A good introduction in German can be found in Linked (Open) Data - Von der Theorie zur Praxis (HTW Chur), and in English from Cambridge Semantics. We watched this introduction video by the Linked Data Service – LINDAS of the Swiss National Archives:

I also shared a recent community project, an Advent’s Calendar (opü.ch) of Linked Open Data queries, which highlights the diversity and practical range of interesting things to discover on the Semantic Web. Among the excellent examples is an introduction from the Statistics department of the City of Zürich, with example code and R scripts (GitHub).

We had a discussion about projects to improve educational material and create shared resources for Data Literacy - which help the work of Data Analysts be better understood and reused - such as:

  • The OGD Handbook at handbook.opendata.swiss, which provides guidance for government and people who work with the public sector.
  • OpenSchoolMaps (community-based) and sCHoolmaps.ch (government-run) are two of the many useful educational resources for working with (open) (geo)data.
  • SchoolofData.ch, part of a civic society initiative involved in research programs with a grassroots international organization and Opendata.ch

– From R survey responses, School of Data on GitHub

Since part of our discussion is about the open data movement in Switzerland, we summarized opportunities and challenges exist here in regards to public data. For a good summary of progress in the area of Open Government Data, see Yearbook of Swiss Administrative Sciences 9(1), pp.66–79.

After discussing the role of the Community in validating use cases, our next goal was to explore open data ourselves in a Hands-on way, looking at what happens behind the scenes and trying out some open source tools on datasets we researched together in class. We looked at some community projects, in particular discussing Open Budgets (Open Data Camp Bern 2013) and Predict Delays (Open Data Day Zürich 2017). We also watched a short clip of last year’s Open Food Data Hackday.

The hallmarks of open development explained let us launch into a mini-hackathon inspired by the make.opendata.ch events. We divided into 4 teams of 4-5 people each, and took up roles (Expert - Designer - Developer), brainstormed and researched open data sources, and tried to find or even (very) rapidly build prototypes with a ticking countdown clock.

Two blocks of 45 minutes were spent on the exercise, which was followed by 5 minute pitches from each team, along with discussion and feedback focusing on unravelling the barriers which prevented teams from getting closer to the challenge they picked or using the data they wanted. The challenges chosen were:

  • How many snow days will there be in Bern in 2025?
  • Does religion have an influence on life expectancy?
  • In which hospital should I seek treatment?
  • How ecological are electric cars?

The group responded very well to the hackathon format. One team even outlined the following „building blocks of success“ in their notes, which I will paraphrase here:

  • Common understanding of the problem
  • Formulate hypothesis
  • Organization in the team
  • Inform yourself about the topic
  • Data retrieval
  • Explore, clean, add more data to the data
  • Analyze, explain / present

One of the questions asked was about the methodology of hackathons, to which I replied with information about the new BFH research project that Opendata.ch is cooperating with: #hack4socialgood. Several people stayed back to chat some more about hackathons and civic tech, machine learning and career paths.

Many thanks to all of the students who enthusiastically took part in the module, to the BFH staff and my fellow teachers of the CAS!

From the next semester there will be a slight change to the program, and I really encourage everyone to check out the list of upcoming open data events such as the annual Opendata.ch conference, this time taking place in Bern.

https://twitter.com/OpendataCH/status/1087384206240555009