[24.11] DataJamDays 2017

oleg · 16 November 2017 23:00

How do we promote relevant scientific information in a way that can be trusted very quickly (i.e. analytically)?

This was one of the central topics we discussed at DataJamDays 2017 run at the EPFL together with datascience.ch (SDSC). A bunch of fascinating and just awesome research datasets were presented in the morning, and there were introductions to the new Renga platform for reproducible data science.

During the morning introductions, I briefly explained the mission of Opendata.ch and Open Knowledge, our work with public and science institutions over the years, drew a line back to the research hackdays with FORS that took place on campus in 2015, and a look forward to Core Data.

During the day, I downloaded and looked into the Renga project, discussed data publishing standards with their team, met some really interesting people, learned more about the architecture and interfaces of the Zenodo and Figshare open science platforms.

I also attended a workshop run by Jan Krause from the EPFL Library who introduced Jupyter and Python in a format inspired by Data Carpentry. This engaging session gave me an excuse to put together a Julia notebook to complement the Python notebooks examined. You can view it here and download the source, or just fork it on GitHub.

Read Exploring the oceans around Antarctica for more impressions from another participant.

A recurrent piece of feedback I have heard in recent times, is that the open data movement needs to partner more with science and not reinvent wheels. Judging by today’s event, the science of data is moving along very quickly in academia, and it is in everyone’s interest that bridges are built and maintained. Improving the discoverability and reusability of open science data is definitely something that we can be more involved in.

Many thanks to the EPFL and to @heluc especially for the chance to contribute to today’s jam. See you in January!

gagarine · 19 November 2017 03:16

Thanks for this update

That makes me wonder. I think OpenData movement should focus a bit more about the complete political process and social questions. So perhaps open data movement can become a bridge between society, politics&institutions, data science and social science.

First, the Goodhart’s law -> “When a measure becomes a target, it ceases to be a good measure.”

But also I think it’s “too easy” to find correlation with data, of course, there are today sophisticated tools than go way beyond linear regression. But still, you are not really using a “system thinking” approach. I’m not saying it’s not possible, good data is also necessary with those kind of approach. What I’m saying is perhaps for politics data is not enough, because it is more than action/reaction: it’s a social system and the system and actors can change making a all-new problem. Also we do not have robust model for the majority of the important questions faced by society and individual every day.

So, in my opinion, OpenData movement should start to really approach how socially “opendata” is supposed to works. What is the effect on peoples, for example.

For example, if you take two groups of peoples with diverging “interest” simple economical optimization making loser and winner is perhaps not the best for society even if the sum is positive.

oleg · 19 November 2017 15:54

Thank you for the supportive comments, I can definitely resonate with the sentiments behind systems- thinking, good data, and a more critical discussion in society being the driving motives for the open data movement in Switzerland and beyond. Glad to have you on board! So what kind of diverging interests do you think are at stake when it comes to open science?

gagarine · 19 December 2017 23:22

You don’t think science is open by definition? I mean, if it’s not public, it’s not science as you can’t replicate the experience.

Say that, perhaps is more about science using data. So for science based on big data I can imagine researchers having a huge pressure to publish and find new stuff. So now if you have a big collection of data and a super computer, you have a huge incentive to search for correlation in those databases. It can give some important results, but it can also be misleading.

More importantly, using advanced pattern recognition (AI/deep learning) you can have better result sometimes but it can become very hard to explain the “why”. For example, imagine you use a lot of parameters to try to predict when they are a war in any given countries. Imagine you end up with a good model with a pretty good accuracy. Of course as there is a lot parameters that create millions of interconnections, you can’t explain “why”.

Then what do you do next time a country is flagged has “certainly going to have a war”?