[27.1] Shipshape open data @ AMLD 2019

oleg · 14 December 2018 14:39

At the Applied Machine Learning Days at EPFL towards the end of January, I will be running a morning workshop: oriented towards anyone who is interested in making use of and publishing open data in ML projects. The description is below, feedback and suggestions welcome!

The Frictionless Data program at Open Knowledge is the leading community-based effort to update and support open data publication processes worldwide. Building on the experience of developing the technology for thousands of data integration projects and portals like opendata.swiss, we are working on an extensible, cohesively formulated set of standards and a library of multiplatform, multilingual libraries and tools to make working with diverse data sources easier, smoother, and more reliable than ever.

This workshop will start with an introduction to the philosophy, concepts, and roadmap of the initiative, contrasted to parallel efforts in data containerization. We will dive in to explore new data sources that support the Frictionless Data specifications, and help you to start a data exploration and machine learning project. Our focus will be on the principles of data exchange and comparability, so more experienced participants can bring their own tools to check compatibility. We will also show easy ways with which beginners will start exploring and extending open data in Julia or Python, and share learning waypoints.

In the second half of the workshop, we will look at the question of reproducibility in data science, discuss the challenges involved, and learn how to (re)publish both our code and data in forms of efficient distributed workflows - both to improve accessibility for other users, and to help ensure authenticity of the result. Several case studies, including work-in-progress, will be shared in the group, and discussion facilitated about the opportunities of ‚industrial-strength‘ open data.

The Swiss chapter of Open Knowledge, Opendata.ch, ran a hackathon at last year’s AMLD, chronicled in this blog. You can find references to prior workshops on these topics at forum.schoolofdata.ch.

Outcome

Learn from a practitioner about the latest trends in the intersection of crowdsourcing and data science. Gain experience in useful tools and methods. Take steps in becoming a more active member of the open data community.

Prerequisites

laptop with an up-to-date web browser
optionally a Python, Julia, R, Node.js, Clojure or Go development environment

You can find details of all the workshops online and pick up a ticket here.

oleg · 27 January 2019 13:31

Slides from my workshop this morning are now available.

Thank you for the opportunity to bring these topics to #AMLD2019 and for the very animated discussion and inputs. I’m taking part in the rest of the conference and hoping to have lots more exchange about the motives, standards, challenges and collaborations driving us forward

oleg · 28 January 2019 11:24

Looking for starting points and examples for boosting your projects with Data Packages? To see the Frictionless Data libraries in action on Machine Learning projects, check out:

In particular the LSTM example from Chapter 16 uses datapackage-py to fetch a dataset from DataHub. You can even use the Python library in your console, though the better supported tool is DataHub’s data-cli.

The starting point for everything else is the official documentation and Field Guide.You’ll find sample code in all the other libraries too, e.g. R, Clojure, Java, Go, Python, JavaScript, as well as my example for Julia:

github.com

frictionlessdata/DataPackage.jl/blob/master/examples/datahub.jl

include("../src/DataPackage.jl")
using DataPackage
import DataPackage: Package, read

data_url = "https://datahub.io/core/pharmaceutical-drug-spending/datapackage.json"

# to load Data Package into storage
package = Package(data_url)

# to load only tabular data
resources = package.resources
for resource in resources
    if resource.profile == "tabular-data-resource"
        data = read(resource)
        println(data)
    end
end

oleg · 28 January 2019 16:01

Here is also an example Jupyter notebook prepared for the workshop, that uses the Python Keras library on a classic Machine Learning dataset, Abalone, based on research made in the 90’s - now hosted in the Machine Learning collection on DataHub. I’ve copied an ML training run with very slight modifications from an excellent blog post by Eric Strong, whose sources are also on GitHub. We added some in-line visuals made with Vega Lite.

Since we ran Keras on top of TensorFlow, we could take advantage of TensorBoard, a visualizer that you can start by running tensorboard --logdir=logs in the same folder as the notebook, then open the displayed link in your web browser to see graphs and representations of the model like these:

If you - like many people I have talked to at AMLD - are interested in reproducible research, you should also check out OpenML where ML studies are conveniently summarized along with comparative reports like this one for a given dataset like abalone. The OpenML project supports Frictionless Data, and can help you integrate and scale your research. @heidi has been involved both with our School of Data initiative and this project.

oleg · 7 February 2019 21:50