Sean.hoyland
This user talk page might be watched by friendly talk page stalkers which means that someone other than me might reply to your query. Their input is welcome and their help with messages that I cannot reply to quickly is appreciated. |
Archives
|
This page has archives. Sections older than 5 days may be automatically archived by Lowercase sigmabot III. |
Looking for a tool
Page intersections
..that does the same as
but also gives you the number of pages you have edited in common. Does that exist? (Just looking for an easy way to find how many pages Icewhiz and I have in common ) Huldra (talk) 23:18, 16 November 2024 (UTC)
- Not that I'm aware of, and I'm not sure of the accuracy of the edit counts produced by editorinteract.py. For example, look at you vs Galamore for 1929 Hebron massacre. It says you=2, Galamore=1. But you made 5 edits to that article, and it should know that because the link to sigma timeline.py says so, as does the sigma usersearch.py.
- Anyway, if you are just after something quick to see the number of pages in common you can do something like below.
select
convert(replace(p.page_title,'_',' ') using utf8mb4) page_title,
p.page_namespace,
p.page_is_redirect,
p2.rev_count 'Huldra+Icewhiz rev_count'
from page p
join (
select ru.rev_page, count(ru.rev_id) as rev_count
from revision_userindex ru
join actor_revision ar on ar.actor_id = ru.rev_actor
where ar.actor_name in ('Huldra', 'Icewhiz')
group by ru.rev_page
having count(distinct ar.actor_id)=2
) p2
on p2.rev_page = p.page_id
order by 1,2,3
- You can use Quarry - see here -> Resultset (598 rows) Sean.hoyland (talk) 05:50, 17 November 2024 (UTC)
- To editor Huldra: Enter "enwiki_p" into the little box on the left above the table, copy the code into the black area, hit "Submit Query". I'm obviously more evil than you because I get 639 rows in common with Icewhiz. But you and I have 2719 in common. Zerotalk 12:49, 17 November 2024 (UTC)
- Strictly speaking, for intersections with Icewhiz, you would need to combine the results for 48 accounts, but I keep putting off thinking about article intersection information as it's another rabbit hole. It would be nice to be able to quantify the likelihood of intersections.
- ['007Леони́д', '11Fox11', 'AnnieGrannyBunny', 'Astral Leap', 'AstuteRed', 'Bob not snob', 'DoraExp', 'Double barrel pistol with both opposite direction', 'EnfantDeLaVille', 'Eostrix', 'Free1Soul', 'Galamore', 'Geshem Bracha', 'Herpetogenesis', 'Hippeus', 'I dream of Maple', 'Icewhiz', 'Jacinda01', 'JoeZ451', 'Just Prancing', 'KasiaNhersL', 'LeftDreams', 'ManoelWild', 'Minden500', 'Molave Quinta', 'Mrboondocks', 'Mvqr', 'O.maximov', 'OdNahlawi', 'PeleYoetz', 'Pikavoom', 'PRL Dreams', 'Proud Indian Arnab', 'Purski', 'RCatesby', 'SCNBAH', 'Seggallion', 'Semper honestus', 'Smoking Ethel', 'SunSun753457', 'Świst lodu', 'Szymon Frank', 'The 2nd coming of Purski', 'UnspokenPassion', 'Uppagus', 'VikingDrummer', 'WhizICE', 'Терпение не ненавижу']
- Sean.hoyland (talk) 13:10, 17 November 2024 (UTC)
- ok, thanks, Sean, much appreciated. Obviously, both Zero and I have been conspiring with Icewhiz offline.
- However, I am not sure I trust those numbers either; why are some article mentioned once, and others twice? Eg, 1917 Jaffa deportation is mentioned twice, while 1929 Hebron massacre is mentioned once? cheers, Huldra (talk) 20:41, 17 November 2024 (UTC)
- @Huldra: It's because different name spaces are listed separately (see the page_namespace column). 0=article, 1=talk, 2=user, 3=user talk, 4=WP, 5=WT, 10=template, 11=template talk, etc.. So the counts are of "pages" not of "articles". Zerotalk 00:08, 18 November 2024 (UTC)
- Huldra+11Fox11=153, Huldra+Galamore=98, Huldra+Geshem Bracha=100, Huldra+PeleYoetz=88, Huldra+Pikavoom=325(!), etc., but there are page overlaps so it isn't correct to add these numbers. Sean, how to test this list of socks all at once? Zerotalk 02:13, 18 November 2024 (UTC)
- Huldra, if you prefer the namespace numbers as descriptions you can translate them like this. Zero0000, I wouldn't do that with SQL. It's easier to handle the data using Pandas or Polars. I'll have a look a bit later. It's a question I've been avoiding or letting marinate for months for various reasons e.g. it's not clear to me how to extract useful information, something like the significance of an intersection, from article intersections. For a start, I don't know what the intersection statistics for a large set of random users with various account ages, editcounts, interests etc. looks like, the background stats, so how can I know whether something is significant or not? It seems possible, in principle at least, to write down a function that estimates the improbability of an intersection based on page revision counts, page unique editor counts, editor editcounts, account age, and various other things that I haven't figured out yet. Sean.hoyland (talk) 04:41, 18 November 2024 (UTC)
- Huldra, unfortunately the evidence suggests that the extent of your conspiracy with Icewhiz and their socks is even more concerning, spanning 1438 pages, a number that is apparently 'extraordinarily high'. There are 3 google sheets here.
- A. df_page_intersect - lists editcounts for page intersections between you and each of Icewhiz's accounts.
- B. df_page_intersect_sum - same as A but with Icewhiz+socks editcounts summed.
- C. df_page_intersect_sock_sum_pivot - pivoted version of B
- Zero, I'll see if there is a way to do this only using SQL that isn't horrendously ugly. Sean.hoyland (talk) 10:40, 18 November 2024 (UTC)
- 1438 pages! I am clutching my pearls, hoping nobody fins out! (I thought of uploading a picture of me, clutching my pearls, but unfortunately my android phone doesn't communicate easily with my Mac) Luckily, those google sheets you linked to is proof of Selfstudier's work with Icewhiz -not mine. cheers, Huldra (talk) 22:08, 18 November 2024 (UTC)
- I was thinking select pages with any of ('Huldra', 'Icewhiz', 'sock', ...), then of those select pages with Huldra and distinct count at least 2. But I didn't get it to work. The purpose is to dispel the idea that such page intersections prove anything other than similar interests, and I agree that an objective measure of how significant a number is is probably difficult or impossible to find. A more direct measure of working together might be to count how many times A restored an edit of B that had been reverted and vice-versa. But that seems tricky as well as hard to calibrate. Zerotalk 11:02, 18 November 2024 (UTC)
- For me, the interest in intersections predates things like the Piratewires nonsense. It stems from two opposites, 1) an SPI case years ago where a highly improbable intersection played a successful role, and 2) an SPI case where a checkuser request was declined because the large number of intersections were regarded as not compelling. The editor had made a lot of edits and many of the pages were high traffic pages despite many improbable overlaps at pages with low pageviews, revision counts etc. Seems like a clue that somewhere in between those 2 outcomes is a better way to extract, integrate and present page intersection evidence. As for the SQL, even getting all of an account's socks is tricky because the category graphs for sockmasters are almost always incomplete e.g. categories only get 39 of the Icewhiz sock accounts. Sean.hoyland (talk) 12:54, 18 November 2024 (UTC)
- How do I find the extent of my conspiracy with Icewhiz and their socks? :) Selfstudier (talk) 10:48, 18 November 2024 (UTC)
- To editor Selfstudier: Good try, we all know you are Icewhiz. Zerotalk 11:04, 18 November 2024 (UTC)
- Well, step one is probably me fixing the bit where I forgot to add usernames to the output so you can tell who is being compared to who (whom?). Sean.hoyland (talk) 11:07, 18 November 2024 (UTC)
- Step two, results for you are in there now. Only 542 pages, an 'incredible abundance' obviously, but not 'extraordinarily high'...probably. Sean.hoyland (talk) 11:27, 18 November 2024 (UTC)
- Some memories there, thanks. Selfstudier (talk) 11:32, 18 November 2024 (UTC)
- Strictly speaking, for intersections with Icewhiz, you would need to combine the results for 48 accounts, but I keep putting off thinking about article intersection information as it's another rabbit hole. It would be nice to be able to quantify the likelihood of intersections.
- To editor Huldra: Enter "enwiki_p" into the little box on the left above the table, copy the code into the black area, hit "Submit Query". I'm obviously more evil than you because I get 639 rows in common with Icewhiz. But you and I have 2719 in common. Zerotalk 12:49, 17 November 2024 (UTC)
<- Zero0000, here are 2 versions of SQL that produce pivoted intersection results.
- version A cheats by hard-coding in the 48 Icewhiz accounts. It's fast (9s), but I think it is missing some results and I don't know why yet.
- version B does it properly by actually selecting all of the accounts. It seems to produce a complete set of results but it's slow (650s), even though the sock selection part is okay...searching block comments always takes a while. Something about the structure is causing the server to use an inefficient execution plan I guess. Not sure what to do about that. Sean.hoyland (talk) 18:19, 18 November 2024 (UTC)
- I am reading the SQL-article, in order to understand what you guys are doing. Unfortunately, it is looks like a typical wikipedia science article: difficult for anyone not knowing what it is about. (But presumably quite clear to those who already know SQL;/), oh well, Huldra (talk) 22:16, 18 November 2024 (UTC)
- SQL was designed by an evil demon. Incidentally Piotrus gets a score of 1942, more evil than any of us. Zerotalk 01:56, 19 November 2024 (UTC)
- This is where all that apparently useless stuff at school about Venn diagrams and set symbols finally has a chance to be useful. SQL just builds sets of things then connects them together like making your own Lego bricks and building something. But this is one area where the LLMs like Claude and ChatGPT really shine because they have looked at millions of lines of SQL. I almost never document any code I write in any language because I'm not a software engineer. It's boring. And I have 'eternal sunshine of the spotless mind' when it comes to code I write. One week is enough time for me to completely forget almost everything about a piece of code and think 'who wrote this garbage?'. Now you can outsource the explaining/documentation to AIs. They're very good at it, much better than their ability to write code. Try putting the SQL in Claude for example and asking it to explain it. Sean.hoyland (talk) 05:13, 19 November 2024 (UTC)
Zero0000, Huldra, I've put a generic-ish query in Quarry.
- version C - you can just specify a reference account name and a comma separated list of one or more other accounts to compare it with.
Sean.hoyland (talk) 16:02, 19 November 2024 (UTC)
- Sean.hoyland: much appreciated, Huldra (talk) 21:06, 19 November 2024 (UTC)
The slow performance of the server for a query where you only need to specify the sockmaster rather than tediously hardcode the sock list was too annoying for me to let go. I've added another version that seems to persuade the server to use a decent plan e.g. 120s vs 650s.
- version D - 'Page intersections between a reference actor and a ban evasion source and their socks' - only need to specify reference actor e.g. Huldra and sockmaster e.g. Icewhiz.
Sean.hoyland (talk) 11:23, 20 November 2024 (UTC)
Revision counting
Putting some stuff here. What do you think of my way to guess at the number of reverts that are not ECR reverts? Another thing: the number of revisions marked "mw-reverted" is significantly higher than the number marked "mw-undo or mw-rollback". This suggests that the way reverts are detected has more cases. Do you know where it is described? Zerotalk 12:14, 20 November 2024 (UTC)
- I'll have a look and get back to you. But for the second question, I noticed a manual revert tag in my watchlist for the first time the other day. I assume it's ctd_name='mw-manual-revert', ctd_id=582. Special:Tags says it's "Edits that manually restore the page source to an exact previous state". 4 million revisions have been tagged so I guess that tag might account for some of the cases. Sean.hoyland (talk) 12:36, 20 November 2024 (UTC)
- The not-ECR guess seems like it might be error prone. The PIA 0 and 1 namespaces probably have quite different protection statistics, especially high traffic pages (I haven't checked). That might mess up the assumption. It's possible to search for ECR related strings in the revision comments but it is likely to be slow and incomplete. It might be worth trying. I see what you mean about the tag stats. Not sure what is going on there. There might be an explanation buried somewhere if you start at Wikipedia:Tags. There's also 'app-undo' = Undo actions made from the mobile apps. But there are only 26 revisions with that tag in PIA (0,1) namespaces between 2020-10-07 and 2024-10-06. Sean.hoyland (talk) 17:40, 20 November 2024 (UTC)
- Zero0000, the counts should probably exclude bot revisions shouldn't they? If you want to do that you can add something like the following to the where clause or the join. It's tempting to write 'x not in (...)', but it wouldn't work in this case because the actor_user can be null (for IPs) so a 'not in' would miss the unregistered/non-user actors and their revisions.
actor_revision.actor_user in ( select ug.ug_user from user_groups ug where ug_group = 'bot' ) is not true
- For 2020-10-07 to 2021-10-06 that would change the edit count from 96687 to 84775 for example.
- As for the mysterious tagging, looking at one page might help e.g. Israeli invasion of the Gaza Strip (disambiguation). The bot vs human back-and-forth in recent revisions is interesting e.g. this bot edit wasn't tagged, even though it looks like it should be treated as a revert.
- 2024-06-09T10:02:38 Bot1058 talk contribs m 272 bytes (−296)
- Sean.hoyland (talk) 07:14, 21 November 2024 (UTC)
- Adding mw-manual-revert to the definition of "revert" does bump up the revert count for recent years. Still not up to the "reverted" count, though. I'm particularly interested in splitting the "reverted" count according to whether the actor was extended-confirmed at the time. Zerotalk 11:20, 21 November 2024 (UTC)
- As for the mysterious tagging, looking at one page might help e.g. Israeli invasion of the Gaza Strip (disambiguation). The bot vs human back-and-forth in recent revisions is interesting e.g. this bot edit wasn't tagged, even though it looks like it should be treated as a revert.
Privilege-based revision labeling
- "...whether the actor was extended-confirmed at the time" sounds straightforward, but in practice it's one of the many things about the wiki system that I haven't quite figured out how to do properly and efficiently. There are also edge cases and I don't know how many there are. Your account is a good example of one of the pitfalls of assuming anything about the system/logging based queries. In theory, you can look for extendedconfirmed grants then use the timestampdiff function to label actors based on a comparison of the EC grant timestamp (or absence) with the revision timestamp in seconds. They could be
- negative (actors granted EC after the revision)
- null (actors still without the EC grant)
- positive (actors granted EC before the revision)
- But sometimes the grant isn't in the log e.g. for sysops in some cases it seems. So, you can look for the sysop grant too. However, if you look at your account the only log entry is from 2022, so a query would assume that's when you became extendedconfirmed. I'm not sure what to do about that. Maybe set the registration date as the EC date for those cases, but then you have to join to the user view which can slow down big queries. Then there is the whole issue of EC being revoked sometimes, where just picking the oldest, which is usually an autopromote grant, may not be the right decision. Anyway, the bottom-line is it's possible to label revisions based on the state of the privileges at the time and count them, but query performance is an unresolved issue for me.
- I'll post some example code later. I'm a bit busy today. Sean.hoyland (talk) 04:36, 22 November 2024 (UTC)
- It's good you asked this question because it reminded me to look for a way to avoid looking at the logs. Sean.hoyland (talk) 07:58, 22 November 2024 (UTC)
- It seems there are no rights logs before Dec 2004, but I got sysop before then. Also, I think that revocation of EC is rare enough that we can ignore it. Zerotalk 11:19, 22 November 2024 (UTC)
- "...whether the actor was extended-confirmed at the time" sounds straightforward, but in practice it's one of the many things about the wiki system that I haven't quite figured out how to do properly and efficiently. There are also edge cases and I don't know how many there are. Your account is a good example of one of the pitfalls of assuming anything about the system/logging based queries. In theory, you can look for extendedconfirmed grants then use the timestampdiff function to label actors based on a comparison of the EC grant timestamp (or absence) with the revision timestamp in seconds. They could be
<- Here's an example to illustrate a few things.
- The example is limited to the '2023-10-07' and '2024-10-06' window in PIA for 5 accounts, ('Ainty Painty','Pave Paws','24.130.244.5','Canterbury Tail','Zero0000'), each of which has different properties.
- The grant_timestamp common table expression shows one of many ways to get the timestamp from the log. It uses the 'alternative view' logging_userindex rather than the logging view to try to speed things up. I don't appear to have the privileges needed to just look directly at a user's privileges/roles unfortunately, which might be faster than looking in the logs. There is a way to get grant information without looking in the logs but weirdly that view doesn't expose the timestamps.
- The select part includes timestampdiff and case lines to illustrate the function usage and one way to do privilege-based Boolean revision labelling.
- Each user illustrates something different.
- Ainty Painty was extendedconfirmed before the revisions so the ec_at_time_of_rev labels are True.
- Pave Paws is not EC so the labels are False.
- 24.130.244.5 is not a registered account so no chance of EC, so the label is False.
- Canterbury Tail started off without EC and was eventually granted the privilege, the labels change from False to True for the last 3 revisions
- Your account illustrates the issue I mentioned, the log timestamp is 2022-07-23. Still not sure how best to address that issue.
- The query performance seems okay but may not scale linearly. I haven't checked yet. For testing, adding limit 10 or something will help.
- I guess you can use a query like this to count the revisions, get rid of some of the columns and use group by to get label and namespaces counts.
with pia_titles as (
select
p.page_title
from
linktarget lt
join templatelinks tl on tl.tl_target_id = lt.lt_id
join page p on p.page_id = tl.tl_from
where
lt.lt_namespace = 10 -- Template
and lt.lt_title in ("ArbCom_Arab-Israeli_enforcement", "Contentious_topics/Arab-Israeli_talk_notice")
and page_namespace = 1 and page_is_redirect = 0
union
select
page_title
from
page
join categorylinks israel on page_id = israel.cl_from and israel.cl_to = "WikiProject_Israel_articles"
join categorylinks palestine on page_id = palestine.cl_from and palestine.cl_to = "WikiProject_Palestine_articles"
where
page_namespace = 1 and page_is_redirect = 0
),
pia as (
select
p.page_id, p.page_title, p.page_namespace
from
pia_titles pt
join page p on p.page_title = pt.page_title
and p.page_namespace in (0,1) and p.page_is_redirect = 0
),
grant_timestamp as (
select log_actor, min(log_timestamp) log_timestamp
from logging_userindex
where log_type = 'rights'
and log_params rlike '(sysop|extendedconfirmed)'
group by 1
)
select
'PIA' area,
ar.actor_id,
convert(ar.actor_name using utf8mb4) actor_name,
gt.log_timestamp ec_timestamp,
ru.rev_timestamp,
timestampdiff(second, gt.log_timestamp, ru.rev_timestamp) seconds_ec_to_rev,
case when timestampdiff(second, gt.log_timestamp, ru.rev_timestamp) > 0 then True else False end ec_at_time_of_rev,
pia.page_namespace
from
actor_revision as ar
join revision_userindex ru on ru.rev_actor = ar.actor_id
join pia on pia.page_id = ru.rev_page
left join grant_timestamp gt on gt.log_actor = ar.actor_id
where
ar.actor_user in (
select ug.ug_user
from user_groups ug
where ug_group = 'bot'
) is not true
and date(ru.rev_timestamp) between '2023-10-07' and '2024-10-06'
and ar.actor_name in ('Ainty Painty','Pave Paws','24.130.244.5','Canterbury Tail','Zero0000')
order by 5 desc
Sean.hoyland (talk) 11:37, 22 November 2024 (UTC)
Performance seems surprisingly okay, or maybe I got lucky. The wiki database server's performance varies wildly. 106 seconds for the '2020-10-07' and '2021-10-06' window to count the PIA non-bot revisions on the enwiki analytics server from my laptop through an SSH tunnel -> Resultset (108808 rows). 258.48 seconds on Quarry including the row data, so a lot of network time. Sean.hoyland (talk) 13:27, 22 November 2024 (UTC)
- Phew! (Maybe you should split this discussion into sections.) I have found that more than half of the reverted edits in mainspace for 2023-10-07 to 2024-10-06 were made by IPs (no user_name) or users who still have less than 500 edits now. So the fraction will be even greater if those who reached 500 after being reverted are included. I'll study your code.
- Grouped PIA revision counts for '2020-10-07' and '2021-10-06' excluding bots, in 121 seconds, I get...
area | ec_at_time_of_rev | page_namespace | rev_count |
---|---|---|---|
PIA | 0 | 0 | 24778 |
PIA | 0 | 1 | 6005 |
PIA | 1 | 0 | 59995 |
PIA | 1 | 1 | 18030 |
Sean.hoyland (talk) 13:58, 22 November 2024 (UTC)
Adding category "WikiProject Israel Palestine Collaboration articles" increased the pool a bit. Zerotalk 02:38, 23 November 2024 (UTC)
- Yes, I think the set of things "inside" the topic area that a group of PIA editors might manually construct, given enough time, might be much larger than the approximation that has been used. It's possible, I suppose, to build a different set from the category graph starting somewhere, but I have never figured out how to know when to stop following the edges. The set grows pretty rapidly and the boundary between inside and outside seems to be quite wiggly and hard to map. If changing the approximation is an option, there's probably several things that could be tried to populate the set. Sean.hoyland (talk) 03:01, 23 November 2024 (UTC)
Making the list of bots just once saved about 10 seconds for me (but the times vary so much for the same script that it's hard to be sure).Zerotalk I'm thinking of the following test for EC: an editor was not EC at the time of an edit if at least one of these is true: (1) actor.actor_user is NULL, (2) registration date - edit date < 30 days, (3) edit count now < 500, (4) there is an extendedconfirmed grant in the log later than the edit date. That would handle people like me correctly. Who would be misjudged other than those who had EC revoked? Zerotalk 04:40, 23 November 2024 (UTC)
- I'll have a think about it. But the first thing is getting the registration date means joining to the user view, which may or may not significantly impact performance. I think it has almost 50 million rows so the effect on performance can be unpredictable. Performance seems to depend in part on what mood the optimizer is in.
- I hesitate to mention this because it's yet another location on my unexplored rabbit hole map. A problem with revision counting is that it counts presence without absence. It's a live revision count. There's all the deleted revision dark matter and there is probably a lot of it in the topic area. What's gone (with article deletions etc.) is gone, or is it? Apparently not. It's still there. You can see it, or at least some of it, in the archive view. So, you can probably get deleted counts, but I've not tried to do anything with that data. It seemed very slow when I poked it (e.g. 30-40 seconds just to count my deleted revisions). There's also an approximate edit count, the edit count on user pages, from the user view, which can be a way to avoid counting revisions if precision is not so important. I don't know how that number is produced exactly because it is not live+deleted, it might be live plus deleted minus redacted I guess. Sean.hoyland (talk) 05:55, 23 November 2024 (UTC)
- + Out of interest, I had a look at the deleted revisions from the archive to see if any can still be joined to existing stuff in the topic area for the '2023-10-07' and '2024-10-06' window, and the answer is very few, 533, maybe copyvios in part at least...and it took 6 minutes. So, I guess the deleted revisions can be ignored. There is a fast way to count deleted revisions for actors using the archive_userindex view instead of archive, if that is ever needed. Sean.hoyland (talk) 10:38, 23 November 2024 (UTC)
- And come to think of it, that deleted revision dark matter would presumably impact the tag counts. Sean.hoyland (talk) 06:13, 23 November 2024 (UTC)
See [1]. 160 seconds isn't too bad and it counts reverts/reverted as well. Zerotalk 06:41, 23 November 2024 (UTC)
- Nice. Sean.hoyland (talk) 07:19, 23 November 2024 (UTC)
- It's interesting if you add a unique actor count column. There's double counting across the namespaces doing that of course but the numbers are still interesting. And if you take out the namespace column so you can see EC vs non-EC actor counts it's always amazing to me how many there are. Sean.hoyland (talk) 15:09, 23 November 2024 (UTC)
I increased the article pool a bit more and added a distinct actors column. After Oct 7, 2023, the number of actors increased but the number of edits per actor also increased. See User:Zero0000/PIAstats#General_Statistics. Zerotalk 11:26, 24 November 2024 (UTC)
The absence of ARBECR revert tagging
I don't know how to find and count revisions in PIA that could be described as ARBECR enforcement in a vague hand-wavy way, but if the list of possible edit summary substrings looked something like the list below, it only takes a few seconds to get the revisions (along with false positives probably and missing many no doubt) for the PIA '2020-10-07' to '2024-10-06' window.
- convert(comment_revision.comment_text using utf8mb4) rlike '(arbecr|wp:ecr|non ec|not an edit request|not edit request|arbpia|notforum|500 edits|non ec|editxy|thirty days|30 days|not a serious request|edit request not done|unclear request)'
Sean.hoyland (talk) 16:58, 22 November 2024 (UTC)
- I think it would be ok to count reverts of a non-EC editor in mainspace to be an "allowed revert" regardless of the edit summary. Zerotalk 04:46, 23 November 2024 (UTC)
Page intersection black hole
Just a note: editors who have been here long/have edited many different pages will of course have a greater interaction than "newbies". Which will explain that Zero and I are apparently Siamese twins, joined at the hip. To the extent these interaction numbers can have any meaning at all, they must be adjusted somehow against the number of pages the editor(s) have edited. Any thoughts about how such an adjustment could be made? Huldra (talk) 21:21, 20 November 2024 (UTC)
- I have a pile of disordered thoughts about those kinds of adjustments that have accumulated over the past months, none of which are helpful. To do it properly seems quite complicated. You can imagine that the likelihood of a page intersection between 2 actors is proportional to various numbers and inversely proportional to various other numbers, but those numbers, and there are probably many different factors related to the page, the editors, all other editors, the state of Wikipedia, change over time. Sean.hoyland (talk) 05:08, 21 November 2024 (UTC)
- Right. To define a statistically meaningful measure you first need a model of independent editing. I have thought about it but it is quite a tough problem. Finally, it is impossible for raw counts to distinguish between friends and enemies. To make the distinction one would need to take into account things like how often they reverted each other. Zerotalk 11:20, 21 November 2024 (UTC)
- User:Zero0000 when you say "independent editing", I think Hidden Markov Model. But that isn't really helpful, is it? When the premise is that they are not independent. Both pro-P and pro-I editors will edit the same set of articles, Huldra (talk) 23:45, 23 November 2024 (UTC)
- When they are not independent is where the potential utility is for me, to isolate improbable events in the huge cloud of revisions and interactions. I can imagine a process running that watches accounts that cross the non-EC to EC boundary for some time. There appear to be surprisingly few accounts that acquire the EC privilege in a given year compared to the total number of new accounts. It's just a few thousand it seems. That very significantly reduces the search space for improbable page intersections that might be ban evasion related, could be used as evidence in an SPI report or trigger an alert or something as part of an automatic active search for ban evading actors. And there does seem to be some kind of relationship between how quickly someone acquires EC and whether their account will be blocked later (or sometimes even before they get EC) for ban evasion or some other reason. Sean.hoyland (talk) 05:31, 24 November 2024 (UTC)
Tagging mysteries
Zero0000, have a look at the top 6 revisions on my talk page here. All tagged as reverted. But by what? Looks like Lowercase sigmabot III during archiving, a revision that was tagged as a Manual revert. Maybe one or more bots are contributing to the Reverted tag counts for non-bot actors.
- 2024-11-05T17:28:19 Lowercase sigmabot III Archiving 2 discussion(s) to User talk:Sean.hoyland/Archive 17) (Tag: Manual revert)
Sean.hoyland (talk) 08:21, 24 November 2024 (UTC)
- Interesting, yes that will increase the non-bot reverted counts for talk pages. But those counts are fairly low, so I wonder if all bot-archiving is included. Maybe it is only when an archiving action restores an exact previous version of the page. Zerotalk 10:28, 24 November 2024 (UTC)
New article out
Noe we also have:
- https://nationalpost.com/opinion/opinion-wikipedia-bias-is-infecting-our-digital-ecosystem
- https://archive.is/ld98N
by Neil Seeman and Jeff Ballabon,
data on github:
They have used South Africa's genocide case against Israel as an example. I am trying to work through it. AFAIK, if an editor uses certain words (such as: genocide, apartheid, ethnic cleansing, war crime, massacre, occupation, colonialism, oppression, terrorist, regime propaganda, racist, extremist, radical, militant, conspiracy), or phrases, they are deemed to be partisan
Among the 27 "highly biased users" are Drsmoo, Eladkarmel ....and Nableezy ;... No wonder they didn't publish who the 27 "suspects" were ...and looking through the methology: this seems like a prime example of the GIGO-principle: garbage inn, garbage out Huldra (talk) 23:58, 23 November 2024 (UTC)
- It never ceases to amaze me how people are willing to spend bulk time on projects like this that show nothing at all. If I understand the methodology, "Hamas is terrorist", "Hamas is not terrorist" and "Israel says Hamas is terrorist" all get the same score because use of "terrorist" shows "strong ideological or political bias". This is pseudoscience at its most pathetic. Incidentally, BilledMammal got a higher "strong bias" score than Nableezy, but poor me only got a score of 0. Zerotalk 05:30, 24 November 2024 (UTC)
- Yes, that is my understanding, too. I am not surprised to see that people are trying to find a connection, what I am surprised over, is that they in reality publish that there is no connection, and that they then claim that this shows an "anti-Israel" bias(!) Chutzpah, I think it is called, Huldra (talk) 23:25, 24 November 2024 (UTC)