New task was created
MoC and CL Scoreboards down
Created on Wednesday 17 February 2021, 16:27Back to task list
Assigned toNo one yet
SubscribersAristide AthanassiadisCarolin BellstedtPaul Hoekman
Another trouble shooting task: the scoreboard is down. (How would I know? Who was checking? Not me, never 😇😝) They are both taking a long time trying to attempt to load, before the MoC forever loading screen pops up.
Btw, maybe we need a trouble shoot tag for the tasks? More tags, yay :)
Discussion and updates
Task was assigned to Paul Hoekman
Yeah I know about this one. It's happened before and it's a very odd server issue. Basically the CPU goes through the roof. Previously only on the server, now also on my development machine (note to self: CPU to 8.70 until I force quit docker), whenever I open this page. I'd done some digging before and I thought I had disabled this page/query to solve this problem. Well, "solve", because obviously we want this list but we can't have the server hang every time the page is opened. Let me see if I can find my previous notes so I at least have them here, and I gotta do further debugging and may have to put a temporary solution in place until we solve this long-term...
Just for future use here are my notes:
- Postgre seems to overload and this is a useful page for debugging
- I have this odd feeling that it's some sort of a bug in Django/Postgres as I feel that the CPU going out of control is not what should happen. The query can time out sure, but without these crazy CPU loads.
- Useful code snippets below
Check current queries:
SELECT pid, datname, usename, query, xact_start FROM pg_stat_activity;
Check if any is up for more than a few sec:
SELECT max(now() - xact_start) FROM pg_stat_activity
WHERE state IN ('idle in transaction', 'active');
Export them for analysis:
COPY (SELECT * FROM pg_stat_activity) TO '/var/lib/postgresql/activity.csv' WITH (format csv, header);
The query that seems to be giving the issue on the server:
SELECT DISTINCT "core_record"."id", "core_record"."name", "core_record"."description", "core_record"."description_html", "core_record"."image", "core_record"."date_created", "core_record"."is_deleted", "core_record"."is_public", "core_record"."old_id", "core_record"."meta_data", "core_people"."record_ptr_id", "core_people"."firstname", "core_people"."lastname", "core_people"."affiliation", "core_people"."email", "core_people"."email_public", "core_people"."website", "core_people"."twitter", "core_people"."google_scholar", "core_people"."orcid", "core_people"."researchgate", "core_people"."linkedin", "core_people"."research_interests", "core_people"."status", "core_people"."user_id", "auth_user"."date_joined" FROM "core_people" INNER JOIN "core_record" ON ("core_people"."record_ptr_id" = "core_record"."id") INNER JOIN "core_message" ON ("core_people"."record_ptr_id" = "core_message"."posted_by_id") INNER JOIN "auth_user" ON ("core_people"."user_id" = "auth_user"."id") INNER JOIN "core_message" T5 ON ("core_people"."record_ptr_id" = T5."posted_by_id") INNER JOIN "core_record" T6 ON (T5."parent_id" = T6."id") LEFT OUTER JOIN "core_forumtopic" ON (T6."id" = "core_forumtopic"."record_ptr_id") LEFT OUTER JOIN "core_work" ON (T6."id" = "core_work"."record_ptr_id") WHERE (NOT "core_record"."is_deleted" AND "core_record"."is_public" AND "core_message"."record_ptr_id" IS NOT NULL AND "core_people"."user_id" IS NOT NULL AND ("core_forumtopic"."part_of_project_id" = 6 OR "core_work"."part_of_project_id" = 6)) ORDER BY "auth_user"."date_joined" DESC LIMIT 21
- This stuff is behaving irregular. Previously, the server CPU load would flame up whenever the page was open AND STAY UP, with this query hanging for ages (I've seen the CPU load being high for HOURS, which I captured way too late before).
- Currently the server CPU load doesn't go out of control... the query seems to "properly" time out and that's it.
- Locally however it does spin out of control (version issue??) making debugging difficult.
- Online one scoreboard works (albeit slow), the other one doesn't. Locally it's the opposite. You can't make this stuff up.
Here are four EXPLAIN queries I ran, putting them here for easy copying (attached SQL). Results showed that:
- Execution time was way higher than planning time... 1939 vs 11 ms
- Huge expense at the first UNIQUE and SORT
- After removing DISTINCT we get 167 ms
- After removing DISTINCT and SORT we get 0.9 (HOLY ....)
- After removing just SORT we get 1907
So at first sight it seems that the DISTINCT is the huge time sink (makes sense, that's generally expensive stuff). But without this we get duplicated people. We can get around the SORT by doing that with datatables in the front end, but distinct must stay (or just filter out with a condition in the template?? seems insane). I think if we can recraft this query to do a different type of JOIN we can get rid of the duplicates. But the entire query is bananas as it is, and it would take time to figure this out - likely requires a raw query, not sure how to do other joins in Django.
Also for the huge discrepancy between planning & execution see this thread. May be worth looking into (but that is not the root cause I believe).
-----> Carolin all of these are NOTES TO SELF so I have this stuff in one place as I dig into this. I'm not sure how long it's going to take to solve this, it's quite a hairy situation. What do you want me to do for now? Disable it?? It sometimes seems to work for some of the subsites...
So, this gets even more bizarre. I just wanted to see what could be done by removing the sort and distinct. In core/views:
#list = People.objects.filter(message_list__isnull=False, user__isnull=False).distinct().order_by("-user__date_joined")
list = People.objects.filter(message_list__isnull=False, user__isnull=False)
Sounds good no? Should now be fast and easy. Not so much. The page that PREVIOUSLY opened well on my local machine with the SORT and DISTINCT in place would NOT open anymore. CPU slowly crept up to 3 and I had to abort mission. I'd tried to open the page multiple times and was looking at the SQL activity log, where is now this point counting query that seems to be hanging. But that's not all. This same pointing counting script gets swapped out for a query to the People table as I refresh the page. And back to point counting. BIZARRE. To add to the mystery, it's always the same person whose points + profile are being opened (Carolin!). I think the first record in the queryset. But why WHY is this happening? The EXPLAINS of these queries are totally normal, super low execution times. Furthermore, after I swapped out the script and now sort and distinct the query, it WORKS AGAIN. No other changes. What.The.(*@#HJ.
I'm thinking it may be best to store the points in the user's meta data and simply calculate it every time any work is completed. It will circumvent this whole issue and in general make the point querying system MUCH faster. However it's not as clean and I foresee more code complexity to ensure we always update etc. But likely it's the way to go. Especially with all this craziness. Anyway attaching the relevant queries that make the system hang, in case I want to try and replicate this later.
I'm gonna leave it at this for now. This is enough bizaro SQL debugging festivities for one evening.
Thanks for going through the "SQL debugging festivities"!! Since you asked me what to do, in my opinion, you don't have to disable it. I will tell the cities today in the course that we know of this issue and are trying to fix it. Not sure if anyone is checking it anyway...
I loved reading this, didn't understand a bit, and you sounded like a mad scientist but thanks for sharing what is happening behind the curtain
Haha OK great you enjoy this reading Aris ;-) And noted Carolin, we'll leave it where it is and I have this on my radar so at some point I'll come back to this...
I am stepping down from my MoC roles but if another programmer wants to take a crack at this then the debugging info is all here.
Paul Hoekman is no longer responsible for this task