Allow for processed data to be re-processed quickly next time

Created on Friday 5 February 2021, 06:01

Back to task list
  • ID
  • Project
    Metabolism of Cities Data Hub
  • Status
  • Priority
  • Type
    Programming work
  • Assigned to
    No one yet
  • Subscribers
    Jens Peters
    Paul Hoekman

You are not logged in

Log In Register

Please join us and let's build things, together!


Sometimes the work of preparing data to be processed takes quite some time. Raw data may need to be manipulated quite a bit in order to get it to be in the right shape and form. After this work is done, the system can finally read and record the data. However, this process will need to be repeated next time the same dataset is being updated by the original author. It would be good to see if there are ways in which we can streamline this process.

Some possibilities include:

  • Looking at common processing patterns (e.g. converting columns to rows) and trying to embed this functionality on the site, instead of having people do it in the spreadsheet.
  • Somehow saving the processing as a formula/macro in the spreadsheet, so it can be re-run easily
  • Looking at python scripts / Jupyter notebooks to help with the processing and thus allow for re-processing new data with the click of a button.

Let's start by collecting some examples of datasets that are time-consuming to process, so that we can make an informed decision on the best route forward. Anyone who has a good example, please post it below.

Discussion and updates

New task was created

Here is a dataset Jens shared in this thread (which is where this task is based on in the first place).

I add some more datasets. The ones I am currently dealing with are basically all providing several columns of data for a common timeline (rows). The target should be to allow appending data easily e.g., by simply pasting the raw data table as downloaded in one excel tab, and a second tab (the one that is then uploaded) links to this and has the data right away in the correct format. A simple possibility would be to duplicate all the required columns with an extension for the data e.g.,
product (name) [A] | product (code)[A] | Amount[A] | Unit[A] | Segment[A] | product (name) [B] | product (code)[B] | Amount[B] | Unit[B] | and so on (date, period and reference space can be common for all).

Would be great if the system could also ignore rows with empty date . Some data tables have gaps due to their format, not having to revise these would be nice (though its also not such a big deal to do it manually ). See examples

Task was assigned to Paul Hoekman

Thanks for the info Jens, this is very helpful! It will take some analysis and time to figure out the best way forward, but this is something we'll look at.

If someone has similar (or other) ideas about how to make updating of data more efficient - or sample files to share - please do post!

Paul Hoekman is no longer responsible for this task