Page MenuHomePhabricator

Check the success of the initial terms migration (does it have holes)
Closed, ResolvedPublic

Description

During analysis of T237984 we determined that the maintenance script migrating data may have suffered some problems and left some data empty for some properties.
We think the same thing probably happened to sections of items too.

We should check to see if there are holes in the new tables and re migrate these entities if needed.

Initially we will try to do these checks in hadoop (see sub ticket)

Event Timeline

Using the sqooped tables..

Looking at the first 10 million items

// Find the diff
spark.sql("""
SELECT
  term_entity_id as old,
  wbit_item_id as new
FROM (

SELECT
  DISTINCT term_entity_id
FROM joal.wikibase_wb_terms
WHERE wiki_db = 'wikidatawiki'
  AND snapshot = '2019-10'
  AND term_entity_id > 0
  AND term_entity_id < 10000000
  AND term_entity_type = 'item'

) as old
LEFT JOIN (

SELECT
  DISTINCT wbit_item_id
FROM joal.wikibase_wbt_item_terms
WHERE wiki_db = 'wikidatawiki'
  AND snapshot = '2019-10'
  AND wbit_item_id > 0
  AND wbit_item_id < 10000000

) as new ON term_entity_id = wbit_item_id
WHERE wbit_item_id IS NULL
""").repartition(64).createOrReplaceTempView("wd_comparison_2")
spark.table("wd_comparison_2").cache()

I find there are 14054 that seem to appear in the old table but not in the new ones.

I'll generate a list of what we need to run over

Looks like there are 2,709,497 missing items while comparing items ids 0 to 70,000,000..

Dump of those IDs coming soon!

I ran across all item ids as of a few hours ago and output a file of integer item ids.

The output can be found at https://analytics.wikimedia.org/published/datasets/one-off/wikidata/addshore/T239470-20191129-1530-item-terms-migration-holes (WARNING 23MB)
2,709,498 lines, each of which is an item we should re run against

Looks like there are 2,709,497 missing items while comparing items ids 0 to 70,000,000..

Dump of those IDs coming soon!

Wait, the script is done until 46Mio and not Q70Mio

Item 70 million is in the tables...

mysql:[email protected] [wikidatawiki]> select count() from wbt_item_terms where wbit_item_id = 70000000;
+----------+
| count() |
+----------+
|       48 |
+----------+
1 row in set (0.00 sec)

We appear to be missing something?

I wrote a bunch of queries in a notebook and have been running them throughout the last week tweaking and refining.
The notebook in my user space for future reference is "Wikidata - New and Old wb_terms comparison (looking for holes)"
I thought I was getting bogus numbers, but apparently not.

The queries that I ran identified a lot of items in the new tables that appear to have things wrong. (exact numbers and a list of items will follow once the last queries complete)

Such as https://www.wikidata.org/wiki/Q10000002

When looking at this item in the new store I could see P9835
When looking at the wb_terms table, all looked good

I then made an edit https://www.wikidata.org/w/index.php?title=Q10000002&diff=1072787715&oldid=1035194061 to trigger some writes to the term store

And the new tables looked in a much better state P9836

Mentioned in SAL (#wikimedia-operations) [2019-12-09T10:46:10Z] <addshore> T239470 addshore@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki wikidatawiki --from-id=10000007 --to-id=10000007

I ran the rebuild script over one of the bugged items, and this correctly fixed the entries in the new store.
Is this probably all left over stuff from the deadlocks issue?

@Ladsgroup where did the re run of the rebuild script start from again?
Maybe we should have a targeted run now against all items returned by the hadoop query? (once I post that list here)

I ran the rebuild script over one of the bugged items, and this correctly fixed the entries in the new store.

🎉 not-surprising yet great news :)

Is this probably all left over stuff from the deadlocks issue?

Yes I'm guessing combination of failures due to deadlocks and the not-rethrown db exception that was being caught (remoevd in https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/549864) account for a very valid theory on what have happened.

Maybe we should have a targeted run now against all items returned by the hadoop query? (once I post that list here)

Sounds like a good plan to me

So, these snapshots were taken roughly 2 weeks ago now and the number of items that appear to have issues is 49,021,987 (a non trivial number.....)
There will always be some false positives here as each snapshot of each table is taken in series, so some of the joins etc for the edits made between the snapshot of each table will be messed up.

List of IDs for this will be at https://analytics.wikimedia.org/published/datasets/one-off/wikidata/addshore/T239470-20191209-1225-item-terms-migration-holes

I'll try and publish the notebook somehow too, but I have some questions to analytics before I do that :)

I ran some checks on your numbers. It seems in some ranges, things are clean, like only 30% have issues in some ranges more than > 95%. This is the numbers separated by millions. 1 means up to Q1Mio:

1 364510
2 415854
3 380372
4 302745
5 356028
6 326523
7 444421
8 404898
9 821285
10 590021
11 475503
12 218112
13 256970
14 384180
15 439502
16 639781
17 320441
18 409604
19 330605
20 201857
21 449418
22 426350
23 582121
24 321645
25 443846
26 462407
27 350897
28 446790
29 488908
30 322142
31 620204
32 364034
33 659610
34 815488
35 784804
36 642381
37 950938
38 912740
39 920671
40 950641
41 948308
42 807250
43 793387
44 849349
45 928451
46 644449
47 910284
48 534375
49 36997
50 397077
51 875785
52 923620
53 819251
54 607872
55 559604
56 751158
57 851437
58 915506
59 945442
60 905496
61 861103
62 841255
63 811589
64 700324
65 841526
66 831120
67 705046
68 658960
69 811256
70 620628
71 687696
72 885236
73 883354
74 878387
75 800494
76 986947
77 976292
78 236429
`

Up to Q50Mio, 26Mio items have issues. I might be wrong but this might mean we need to run it practically again.

I guess that is mainly down to the data being 2 weeks old and also the constant stream of edits fixing the issue and also the maint script running.

Addshore renamed this task from Check the success of the terms migration (does it have holes) to Check the success of the initial terms migration (does it have holes).Dec 10 2019, 2:49 PM
Addshore closed this task as Resolved.

We decided in the daily today to scope this task down to the initial migration run.
If we need to do it again we will create another task!