Fix shapefile processing holdup

Created on Thursday 29 April 2021, 08:06

Back to task list
  • ID
    984966
  • Project
    Metabolism of Cities Data Hub
  • Status
    Completed
  • Priority
    Medium
  • Type
    Programming work
  • Assigned to
    Paul Hoekman
  • Subscribers
    Paul Hoekman

You are not logged in

Log In Register

Please join us and let's build things, together!

Description

The team from Mikkeli reported that shapefile processing took a while. There are some issues around filesize, but I checked the logs and there is actually an issue with the automatic cron that processes these files. So let's look into that at the same time.

Discussion and updates


New task was created


Task was assigned to Paul Hoekman


Status change: Open → In Progress


Carolin (and others) please note that the script to process shapefiles runs every 6 hours on the server. A very large shapefile (think NUTS) can take many hours to process, but any more normal file will take 10-20 minutes at most. So unless something huge is in the pipeline, we should really expect any shapefile to be processed within 8 hours or so. If you check the next day and your shapefile is still PENDING PROCESSING, then likely something is amiss. (NOTE: if the shapefile is too large and requires an admin to lift the block, then it will not be in that waiting list anymore so ignore those ones).

Anyway, firstly to show you what you can do. If you see this problem, the easiest thing to do is to go to the admin section and click on Django Crons > Cron Job Log. There you see all the scripts that run automatically in the background (those are called crons). What you should see is that the last column displays green checkboxes. However, if you see red X signs, it means that there is a programmatic error with the processing script. In all likelihood, this error will persist until it is resolved and no new shapefiles will be processed until that fix is in place.

So, if you see this, sound the alarm! Just create a task for me and I will get on it. You can even open the cron job and check the output - the system will provide some cryptic stuff that is very helpful to me to debug the problem.

In this task I will outline a bit what I am doing and I will also try and make reviewing and monitoring this stuff a bit easier by trying to document this better (hence this explanation).


Okay I'm attaching the error from the cron job. The main error is generally visible at the end of the whole debug string. Here it is:

django.db.utils.DataError: Geometry has Z dimension but column does not

Now, this is something I remember from a past bug. Digging into our models file I found the previous issue. See line 1442 in models.py. I even added some comments it turns out:

> This type has a "Z" geometry which needs to be changed to a 2-dimensional geometry
> See also https://stackoverflow.com/questions/35851577/strip-z-dimension-on-geodjango-force-2d-geometry

So the problem is that there is a Z-dimension inside the shapefile. Turns out .shp files support 3 and 4 dimensional coordinates through the Z and M dimensions. However, if these are present it presents a problem when recording it in our db, so we need to strip this. I have done that with a certain type of shapefile, but I suspect that the culprit file is one that is processed differently. So, the search is on...


Continuing to document my journey...

It is always a bit tricky to see which one is being processed next. I have to download the entire db (which is growing absolutely huge at this point) and then run it locally. Quite a pain. In order to try and simplify I ran this script online:

SELECT "core_record"."id", "core_record"."name" FROM "core_libraryitem" INNER JOIN "core_record" ON ("core_libraryitem"."record_ptr_id" = "core_record"."id") INNER JOIN "core_libraryitemtype" ON ("core_libraryitem"."type_id" = "core_libraryitemtype"."id") WHERE (NOT "core_record"."is_deleted" AND "core_record"."is_public" AND ("core_record"."meta_data"->>'ready_for_processing') = 'true' AND "core_libraryitemtype"."name" = 'Shapefile') ORDER BY "core_libraryitem"."year" DESC, "core_record"."name" ASC;

Note: this is the .query generated through the Django statement but with some manual edits because the jsonfield querying syntax for some reason does not output valid SQL code (?). Anyways, let's document this for my own sake if needed in the future.

And... what came out of this? The attached screenshot shows. Turns out... the very same "Railways in Mikkeli (multitrack network)" we were talking about earlier (about it being so big) is the first file on the list and thus creating this issue! Well, it starts making some sense because this explains why the file is so large (because the shapefile contains the elevation coordinates in addition to the lat and lng!).

The good news is that if the users are able to remove this themselves, then this problem will be sorted just like that. I will in any case let them know about this Z dimension in their shapefile. We do need a more permanent fix so I will also keep looking into the code, but I will continue with that later.


Okay good news... the file in question was removed/fixed by the data processor and it is now all working again! Cron is back in the green. This doesn't fix the underlying issue but removes the urgency. I'm going to try and fix this at some point - at least it is all documented here so it can be picked up again...


So... I figured let's try to solve this once and for all. I replicated this problem locally using the 3D shapefile in question. In this particular case the shapefile consisted of LineStrings (it was for a railroad shapefile), and it was set to be merged into a single referencespace, so I am using the union() function to merge different geometries. Turns out I did add the removal of the Z dimension in every step of the way. However, for some reason this did not work in this particular case. Seems like a MultiLineString (which is what the merged geometry ends up being) can not be stripped of its 3rd dimension. Not sure why or how. Anyways, it was sorted by simply having a final check before saving to see the geometry "hasz" and to record a user-friendly error without corrupting our whole run.

For now, this solves the problem. In the future I think we can also check on this. Apparently we can set dim=3 on the db and simply have Z dimensions supported. I didn't want to do this because I am not yet seeing any use of storing this 3rd dimension. If we find a solid use case let's activate it, but for now let's reduce shapefiles in size and take this out.

Lastly I added a page to the controlpanel (of all STAF subsites) that reads PENDING SHAPEFILES where anyone can check which shapefiles are pending PLUS which shapefiles have returned errors. It's nice to have this global overview readily at hand. Attached a screenshot.


Status change: In Progress → Completed