Wikipedia:Bots/Requests for approval/Bot1058 8
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard. The result of the discussion was Approved.
Operator: Wbm1058 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 02:36, Saturday, June 25, 2022 (UTC)
Automatic, Supervised, or Manual: automatic
Programming language(s): PHP
Source code available: refreshlinks.php, refreshmainlinks.php
Function overview: Purge pages with recursive link update in order to refresh links which are old
Links to relevant discussions (where appropriate):
- User talk:wbm1058/Continuing null editing
- Wikipedia talk:Bot policy/Archive 29#Regarding WP:BOTPERF
- Periodically run refreshLinks.php on production sites. (T157670; Open, Low priority)
- Force pages to be fully re-parsed occasionally (T135964; Open, Needs Triage)
Edit period(s): Continuous
Estimated number of pages affected: ALL
Exclusion compliant (Yes/No): No
Already has a bot flag (Yes/No): Yes
Function details: This task runs two scripts to refresh English Wikipedia page links. refreshmainlinks.php null-edits mainspace pages whose page_links_updated database field is older than ~32 days, and refreshlinks.php null-edits all other namespaces whose page_links_updated database field is older than ~80 days. The 32- and 80-day figures may be tweaked as needed to ensure more timely refreshing of links or reduce load on the servers. are approximate and are dynamically adjusted after each run. If fewer than 250 pages were purged by the current run then the number of days ago to purge is decremented by one, e.g. reduced from 32 to 31. If the full limit of 10,000 pages was purged then the number of days ago to purge is incremented by one, e.g. increased from 80 to 81. Each script is configured to edit purge a maximum of 150,000 10,000 pages on a single run, and restart every three hours if not currently running (thus each script may run up to 8 times per day). restarts after sleeping for two minutes. The tasks run as continuous jobs on the Toolforge Kubernetes as my PHP code runs continuous loops. The only "babysitting" needed is to periodically rename the log files and then restart the toolforge jobs, in order to limit the size of the log files.
Status may be monitored by these Quarry queries:
and this Toolforge systems query:
Discussion
I expect speedy approval, as a technical request, as this task only makes null edits. Task has been running for over a month. My main reason for filing this is to post my source code and document the process including links to the various discussions about it. – wbm1058 (talk) 03:02, 25 June 2022 (UTC)[reply]
- Comment: This is a very useful bot that works around long-standing feature requests that should have been straightforward for the MW developers to implement. It makes sure that things like tracking categories and transclusion counts are up to date, which helps gnomes fix errors. – Jonesey95 (talk) 13:30, 25 June 2022 (UTC)[reply]
- Comment: My main concerns are related to the edit filter; I'm not sure whether that looks at null edits or not. If it does, it's theoretically possible that we might suddenly be spammed by a very large number of filter log entries, if and when a filter gets added that widely matches null edits (and if null edits do get checked by the edit filter, we would want the account making them to have a high edit count and to be autoconfirmed, because for performance reasons, many filters skip users with high edit counts).
To get some idea of the rate of null edits: the robot's maximum editing speed is 14 edits per second (150000 × 8 in a day). There are 6,923,997 articles, 62,026,743 pages total (how did we end up with almost ten times as many pages as articles?); this means that the average number of edits that need making per day is around 825000 per day, or around 9.5 per second. Wikipedia currently gets around 160000 edits per day (defined as "things that have an oldid number", so including moves, page creations, etc.), or around 2 per second. So this bot could be editing four times as fast as everyone else on Wikipedia put together (including all the other bots), which would likely be breaking new ground from the point of view of server load (although the servers might well be able to handle it anyway, and if not I guess the developers would just block its IP from making requests) – maybe a bit less, but surely a large proportion of pages rarely get edited.
As a precaution, the bot should also avoid null-editing pages that contain
{{subst:
(possibly with added whitespace or comments), because null edits can change the page content sometimes in this case (feel free to null-edit User:ais523/Sandbox to see for yourself – just clicking "edit" and "save" is enough); it's very hard to get the wikitext to subst a template into a page in the first place (because it has a tendency to replace itself with the template's contents), but once you manage it, it can lay there ready to trigger and mess up null edits, and this seems like the sort of thing that might potentially happen by mistake (e.g. Module:Unsubst is playing around in similar space; although that one won't have a bad interaction with the bot, it's quite possible we'll end up creating a similar template in future and that one will cause problems). --ais523 23:06, 6 July 2022 (UTC)- While this task does not increase the bot's edit count, it has performed 7 other tasks and has an edit count of over 180,000 pages which should qualify as "high". wbm1058 (talk) 03:38, 8 July 2022 (UTC)[reply]
- There are far more users than articles; I believe User talk: is the largest namespace and thus the most resource-intensive to purge (albeit perhaps with a smaller average page size). wbm1058 (talk) 03:38, 8 July 2022 (UTC)[reply]
- The term "null edit" is used here for convenience and simplification; technically the bot purges the page cache and forces a recursive link update. This is about equivalent to a null edit, but I'm not sure that it's functionally exactly the same. – wbm1058 (talk) 03:38, 8 July 2022 (UTC)[reply]
- Ah; this seems to be a significant difference. A "purge with recursive link update" on my sandbox page doesn't add a new revision, even though a null edit does. Based on this, I suspect that purging pages is lighter on the server load than an actual null edit would be, and also recommend that you use "purge with recursive link update" rather than "null edit" terminology when describing the bot. --ais523 08:32, 8 July 2022 (UTC)[reply]
- Yes and just doing a recursive link update would be even lighter on the server load. The only reason my bot forces a purge is that there is currently no option in the API for only updating links. See this Phabricator discussion. – wbm1058 (talk) 12:42, 8 July 2022 (UTC)[reply]
- Ah; this seems to be a significant difference. A "purge with recursive link update" on my sandbox page doesn't add a new revision, even though a null edit does. Based on this, I suspect that purging pages is lighter on the server load than an actual null edit would be, and also recommend that you use "purge with recursive link update" rather than "null edit" terminology when describing the bot. --ais523 08:32, 8 July 2022 (UTC)[reply]
- As I started work on this project March 13, 2022 and the oldest page_links_updated date (except for the Super Six) is April 28, 2022, I believe that every page in the database older than 72 days has now been null-edited at least once, and I've yet to see any reports of problems with unintended substitution. wbm1058 (talk) 03:38, 8 July 2022 (UTC)[reply]
- This is probably a consequence of the difference between purges and null edits; as long as you stick to purges it should be safe from the point of view of unintended substitution. --ais523 08:32, 8 July 2022 (UTC)[reply]
- To make this process more efficient the bot bundles requests into groups of 20; each request sent to the server is for 20 pages to be purged at once. wbm1058 (talk) 03:38, 8 July 2022 (UTC)[reply]
- Comment: I've worked the refreshlinks.php cutoff from 80 down to 70 days; the process may be able to hold it there. I've been trying to smooth out the load so that roughly the same number of pages are purged and link-refreshed each day. – wbm1058 (talk) 11:49, 8 July 2022 (UTC)[reply]
- Note. This process is dependent on my computer maintaining a connection with a Toolforge bastion. Occasionally my computer becomes disconnected for unknown reasons, and when I notice this I must manually log back in to the bastion. If my computer becomes disconnected from the bastion for an extended time, this process may fall behind the expected
page_links_updated
dates. – wbm1058 (talk) 11:55, 12 July 2022 (UTC)[reply] - Another note. The purpose/objective of this task is to keep the pagelinks, categorylinks, and imagelinks tables reasonably-updated. Regenerating these tables for English Wikipedia using the rebuildall.php maintenance script is not practical for English Wikipedia due to its huge size. Even just running the RefreshLinks.php component of rebuildall is not practical due to the database size (it may be practical for smaller wikis). The goal of phab:T159512 (Add option to refreshLinks.php to only update pages that haven't been updated since a timestamp) is to make it practical to run RefreshLinks.php on English Wikipedia. My two scripts find the pages that haven't been updated since a timestamp, and then purge these pages with recursive link updates. Recursive link updates is what refreshLinks.php does. – wbm1058 (talk) 14:42, 16 July 2022 (UTC)[reply]
- Approved for trial (30 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Let's see if anything breaks. Primefac (talk) 16:24, 6 August 2022 (UTC)[reply]
- @Primefac: This task just purges the page cache and forces recursive link updates, so there are no relevant contributions and/or diffs for me to provide a link to. But I see that text is coming from the {{BotTrial}} template, so you probably didn't intend to make that request. As to "anything breaking", the bot went down sometime after I left on wikibreak, and now that I'm back it's catching up. In other words, the task as currently configured "breaks" easily and requires a lot of minding to keep it running. Perhaps it would be more reliable if I figured out how to set it up as a tool running from my Toolforge admin console. – wbm1058 (talk) 15:11, 25 August 2022 (UTC)[reply]
- To improve reliability, I suggest running the task on the toolforge grid. When running on the grid, the server running your code and the database are on the same hi-speed network. You appear to have tunnelled the toolforge database to local port 4711. This setup is only intended for development-time debugging and will be unreliable for long-running tasks, as you have discovered. Also, I suggest using significantly lesser limit than 150000 – that is a very large number of titles to expect from a single database call, and could cause timeouts and/or put too much pressure on the database. Instead process just 5-10k titles at a time, and run the script more frequently. – SD0001 (talk) 19:18, 29 August 2022 (UTC)[reply]
- @SD0001 and Primefac: I set up https://toolsadmin.wikimedia.org/tools/id/refreshlinks now I'm trying to figure out what to do with it. Apparently Grid is legacy and deprecated, and Jobs framework and Kubernetes are preferred for new bot setups. But before I automate this task on Toolforge I need to set it up there so I can manually run it. Per the Toolforge quickstart guide (which is anything but quick for helping me get started) I created my tool's code/html root directory:
mkdir public_html
but I don't need to create my bot's code, I just need to copy it to that directory. One of the files needed to run my bot is the file containing login passwords and I'm leery of copying that to a directory with "public" in its name! Some guidance on how to do this would be appreciated since the quickstart authors apparently felt that wasn't necessary. Microsoft Notepad probably isn't installed on the Toolforge and I probably need Linux rather than Microsoft commands. Can I import the files from wikipages (i.e. User:Bot1058/refreshlinks.php)? wbm1058 (talk) 19:09, 31 August 2022 (UTC)[reply]- @Wbm1058. All files in the tool directory (not just public_html) are public by default. Passwords, OAuth secrets and the like can be made private by using chmod,
chmod 600 file-with-password.txt
.
Since you're creating a bot and not than a webservice, the files shouldn't go into public_html. They can be in any directory. See wikitech:Help:Toolforge/Grid for submitting jobs to the grid. (The grid is legacy, yes, but the newer k8s-based Jobs framework is not that mature and can be harder to work with, especially for people not familiar with containers.)
To copy over files from a Windows system, IMO the best tool is WinSCP (see wikitech:Help:Access to Toolforge instances with PuTTY and WinSCP). It's also possible to edit files directly on toolforge, such as by using nano. – SD0001 (talk) 20:39, 31 August 2022 (UTC)[reply]- I finally got around to installing WinSCP. That was easy since it uses PuTTY and I just told it to use my configuration that I previously installed for PuTTY. I couldn't find any of the three "Advanced Site Settings" screens; it appears those were in a previous version of WinSCP but are not in the current version 5.21.3. Not sure I really need them since the setup seems to all have been automatically imported from PuTTY. I think "Advanced Site Settings" was renamed to "Preferences". Under "Preferences"→"Environment" I see "Interface, Window, Commander, Explorer, Languages" rather than "Directories, Recycle bin, Encryption, SFTP, Shell".
Now I see I created the directory /mnt/nfs/labstore-secondary-tools-project/refreshlinks for my first "tool",
and the sub-directory /mnt/nfs/labstore-secondary-tools-project/refreshlinks/public_html (my tool's code/html root directory)
I also have a personal directory /mnt/nfs/labstore-secondary-tools-home/wbm1058 which has just one file: replica.my.cnf (my database access credentials)
and when I try to look at other user's personal directories I get "Permission denied" errors so I assume that any PHP code I put in my personal directory would be private so only I could read it. My tool also has a replica.my.cnf file which I can't read with WinSCP when logged into my personal account. But if in PuTTY I "become refreshlinks" then I can read my tool's replica.my.cnf file and see that it's different credentials than my personal replica.my.cnf file.
All my bots use the botclasses framework (User:RMCD bot/botclasses.php). Should I create another tool named "botclasses" for my framework, to avoid the need to make separate copies for each individual tool that uses it? I see wikitech:Portal:Toolforge/Tool Accounts#Manage files in Toolforge that I may need to "take ownership" of files or "mount" them. §Sharing files via NFS (what is NFS?) says "Shared config or other files may be placed in the/data/project/shared
directory, which is readable (and potentially writeable) by all Toolforge tools and users." Still trying to digest this information. – wbm1058 (talk) 17:41, 15 September 2022 (UTC)[reply]- answering my own question: NFS = Network File System, a distributed file system protocol originally developed by Sun Microsystems in 1984. – wbm1058 (talk) 19:10, 6 October 2022 (UTC)[reply]
- Yes, personal user directories are private. replica.my.cnf files are different for each user and tool and have the mode
-r--------
which means only the owner can read and no one can modify.The recommendation to use different tool accounts per "tool" is for webservices (since each tool account can have only one web domain). For bots, just use a single tool account for multiple bots – that's easier to maintain and manage. – SD0001 (talk) 05:53, 18 September 2022 (UTC)[reply]- Thanks. Then I'd like to rename
refreshlinks
to a more generic name that covers all my bots, but tools can't be renamed, nor can maintainers delete Tool Accounts. I will follow the steps described at Toolforge (Tools to be deleted). It should be obvious from my experience trying to get a "quick start" on Toolforge why you have such a growing list of tools that have been volunteered for deleting by their maintainers. – wbm1058 (talk) 18:11, 22 September 2022 (UTC)[reply] - @SD0001: I set up https://toolsadmin.wikimedia.org/tools/id/billsbots and then in PuTTY I "become billsbots" and
mkdir php
creating a PHP directory where I can upload needed files from the PHP directory on my Windows PC. Then I go over to WinSCP to try to upload the files. There I can upload botclasses.php into /billsbots/ root directory but I don't have permission to upload to the /billsbots/php/ sub-directory I just created. I see "tools.billbots" is the owner of the /billsbots/php/ sub-directory but wbm1058 is owner of botclasses.php. I logged into WinSCP the same way I log into PuTTY as wbm1058. Is there a way inside WinSCP to "become billsbots" analogous to the way I do that in PuTTY? I assume "tools.billbots" should be the owner of its public PHP files and not "wbm1058"? Also unsure of what rights settings the php directory and the files in that directory that don't house passwords should have. Right now they just are the default frommkdir php
and the upload. –wbm1058 (talk) 18:52, 24 September 2022 (UTC)[reply]- There's no need to become the tool in WinSCP – group permissions can be used instead of owner permissions. The group
tools.billsbot
includes the userwbm1058
. Problem in this case is that the group doesn't have write permission. See wikitech:Help:Access_to_Toolforge_instances_with_PuTTY_and_WinSCP#Troubleshooting_permissions_errors. Files which don't have passwords typically should have 774 (owner+group can do everything, public can read) perms. – SD0001 (talk) 05:38, 25 September 2022 (UTC)[reply]
- There's no need to become the tool in WinSCP – group permissions can be used instead of owner permissions. The group
- Thanks. Then I'd like to rename
- I finally got around to installing WinSCP. That was easy since it uses PuTTY and I just told it to use my configuration that I previously installed for PuTTY. I couldn't find any of the three "Advanced Site Settings" screens; it appears those were in a previous version of WinSCP but are not in the current version 5.21.3. Not sure I really need them since the setup seems to all have been automatically imported from PuTTY. I think "Advanced Site Settings" was renamed to "Preferences". Under "Preferences"→"Environment" I see "Interface, Window, Commander, Explorer, Languages" rather than "Directories, Recycle bin, Encryption, SFTP, Shell".
- @Wbm1058. All files in the tool directory (not just public_html) are public by default. Passwords, OAuth secrets and the like can be made private by using chmod,
- @SD0001 and Primefac: I set up https://toolsadmin.wikimedia.org/tools/id/refreshlinks now I'm trying to figure out what to do with it. Apparently Grid is legacy and deprecated, and Jobs framework and Kubernetes are preferred for new bot setups. But before I automate this task on Toolforge I need to set it up there so I can manually run it. Per the Toolforge quickstart guide (which is anything but quick for helping me get started) I created my tool's code/html root directory:
- To improve reliability, I suggest running the task on the toolforge grid. When running on the grid, the server running your code and the database are on the same hi-speed network. You appear to have tunnelled the toolforge database to local port 4711. This setup is only intended for development-time debugging and will be unreliable for long-running tasks, as you have discovered. Also, I suggest using significantly lesser limit than 150000 – that is a very large number of titles to expect from a single database call, and could cause timeouts and/or put too much pressure on the database. Instead process just 5-10k titles at a time, and run the script more frequently. – SD0001 (talk) 19:18, 29 August 2022 (UTC)[reply]
- @Primefac: This task just purges the page cache and forces recursive link updates, so there are no relevant contributions and/or diffs for me to provide a link to. But I see that text is coming from the {{BotTrial}} template, so you probably didn't intend to make that request. As to "anything breaking", the bot went down sometime after I left on wikibreak, and now that I'm back it's catching up. In other words, the task as currently configured "breaks" easily and requires a lot of minding to keep it running. Perhaps it would be more reliable if I figured out how to set it up as a tool running from my Toolforge admin console. – wbm1058 (talk) 15:11, 25 August 2022 (UTC)[reply]
@SD0001: Thank you so much for your help. I've now successfully manually run refreshlinks.php from the command prompt in PuTTY. I need to be logged in as myself for it to work, and not as my tool, because I own and have read permission for my password file, and my tool does not. Per wikitech:Help:Toolforge/Grid#Submitting simple one-off jobs using 'jsub' when I become my tool then
jsub -N refreshlinks php /mnt/nfs/labstore-secondary-tools-project/billsbots/php/refreshlinks.php
and I got this in myrefreshlinks.out
file:
Warning: include(/mnt/nfs/labstore-secondary-tools-project/billsbots/php/logininfo.php): failed to open stream: Permission denied in /mnt/nfs/labstore-secondary-tools-project/billsbots/php/refreshlinks.php on line 28
– wbm1058 (talk) 15:32, 1 October 2022 (UTC)[reply]
- @Wbm1058
become
the tool,take
the file (transfers ownership to tool) and then dochmod 660
– that would give access to both yourself and the tool. – SD0001 (talk) 18:20, 1 October 2022 (UTC)[reply]
- @SD0001 and Primefac:I just got an email notice for Phabricator T319590: Migrate billsbots from Toolforge GridEngine to Toolforge Kubernetes. Damn, I haven't even gotten anything running on an automated basis yet, just a few one-time runs as I try to familiarize myself with how the GridEngine works, and already I have a bureaucratic nag! I knew going into this that establishing my bots on Toolforge would not be easy, and my expectations have been exceeded! Maybe I just need to bite the bullet and learn how to use the "not that mature" and possibly "harder to work with" Jobs framework, and familiarize myself with containers. – wbm1058 (talk) 16:35, 6 October 2022 (UTC)[reply]
- @Wbm1058 Looks like that was part of mass-creation of tickets so nothing to urgently worry about (they've covered A to D only so my tool hasn't come up yet!). If they're becoming pushy about this, I suppose the Jobs framework is mature now, though there are quite a few things it doesn't support.
It should be easy enough to migrate - instead of putting a jsub command in crontab for scheduling, use toolforge-jobs command, passing--image
astf-php74
. – SD0001 (talk) 17:53, 6 October 2022 (UTC)[reply]
- @Wbm1058 Looks like that was part of mass-creation of tickets so nothing to urgently worry about (they've covered A to D only so my tool hasn't come up yet!). If they're becoming pushy about this, I suppose the Jobs framework is mature now, though there are quite a few things it doesn't support.
- Just noticed now that I got an email on October 9 which I overlooked at first because I didn't recognize the sender.
- sftp-server killed by Wheel of Misfortune on tools bastion
- From Root <root@tools.wmflabs.org>
Your process `sftp-server` has been killed on tools-sgebastion-10 by the Wheel of Misfortune script.
You are receiving this email because you are listed as the shell user running the killed process or as a maintainer of the tool that was.
Long-running processes and services are intended to be run on the either the Kubernetes environment or the job grid not on the bastion servers themselves. In order to ensure that login servers don't get heavily burdened by such processes, this script selects long-running processes at random for destruction.
See <https://phabricator.wikimedia.org/T266300> for more information on this initative. You are invited to provide constructive feedback about the importance of particular types long running processes to your work in support of the Wikimedia movement.
For further support, visit #wikimedia-cloud on libera.chat or <https://wikitech.wikimedia.org>
- I guess that explains why
the task as currently configured "breaks" easily and requires a lot of minding to keep it running.
Thanks, I guess, for this belated message that came only 3 1⁄2 months after I got my automated process running this way. So I suppose speedy approval isn't merited and won't be forthcoming. I did not know that I was running a process namedsftp-server
. What is that, and what is it doing? Most of this bot's process is still running on my own PC. Every few hours when a new script-run starts, it logs into the replica database and does a query which, even when it returns 150K results, takes only a couple of minutes. Then it logs out. It's not like this is constantly hitting on bastion resources. The only reason I need to be logged into the bastion 24×7 (via PuTTY) is that, if I'm not, then my bot, when it starts, will not be able to "tunnel" and thus will fail. The vast majority of the time I'm logged into the bastion, I'm just sitting there idle, doing nothing. Not "heavily burdening" the login server. I need to "tunnel" because there is no MediaWiki API for the database query I need to make. Otherwise I don't need the Toolforge because there is an API for making the "null edit" purges. – wbm1058 (talk) 15:53, 14 October 2022 (UTC)[reply]- I think the Wheel of Misfortune
sftp-server
kills are from my open WinSCP session. I didn't get WinSCP installed and running until September 15, and the first email I saw from the Wheel of Misfortune was sent on October 9 (and I've received several since then). I keep WinSCP open on my desktop for my convenience. I just saw there is a "Disconnect Session" option on the "Session" tab in WinSCP and I just clicked on it. Hopefully that will stop the Wheel of Misfortune's anger. Now I can just click "Reconnect Session" when I go back to use WinSCP again – which saves me the trouble of needing to close and reopen the entire app. As far as I know the Wheel of Misfortune has never actually shut down my bot itself, perhaps because individual bot runs are not sufficiently long-running processes to draw the attention of the "Wheel". Even runs that purge 150,000 pages run in a matter of hours, not days. – wbm1058 (talk) 17:21, 19 December 2022 (UTC)[reply]
- I think the Wheel of Misfortune
- Perhaps helpful to see how other bots running on Toolforge are configured to find a template for how to set mine up. – wbm1058 (talk) 22:45, 14 October 2022 (UTC)[reply]
- Here's how I set my PHP bots up: User:Novem Linguae/Essays/Toolforge bot tutorial#Running at regular intervals (cronjob, kubernetes, grid). I found kubernetes to have a heavy learning curve, but I suppose getting the code off your local computer and onto Toolforge is the "proper" way to do things. Another method might be setting up a webserver on Toolforge/kubernetes that is an API for the query you need to make. Hope this helps. –Novem Linguae (talk) 08:35, 15 October 2022 (UTC)[reply]
- Being connected to the bastion 24x7 is a no-no. Ideally, the bot process should run on toolforge itself so that no connection is needed at all between your local system and toolforge. If you really want to run the bot on local system, the tunnel connection to the database should be made only when required, and closed immediately after. Creating temporary new connections is cheap, leaving them open indefinitely is not. – SD0001 (talk) 16:51, 16 October 2022 (UTC)[reply]
- I've got my first Kubernetes one-off job running now, to refresh 40,000 pages. Commands I used to get it started:
wbm1058@tools-sgebastion-10:~$ become billsbots
tools.billsbots@tools-sgebastion-10:~$ toolforge-jobs run refreshlinks-k8s --command "php ./php/refreshlinks.php" --image tf-php74 --wait
ERROR: timed out 300 seconds waiting for job 'refreshlinks-k8s' to complete:
+------------+-----------------------------------------------------------------+
| Job name: | refreshlinks-k8s |
+------------+-----------------------------------------------------------------+
| Command: | php ./php/refreshlinks.php |
+------------+-----------------------------------------------------------------+
| Job type: | normal |
+------------+-----------------------------------------------------------------+
| Image: | tf-php74 |
+------------+-----------------------------------------------------------------+
| File log: | yes |
+------------+-----------------------------------------------------------------+
| Emails: | none |
+------------+-----------------------------------------------------------------+
| Resources: | default |
+------------+-----------------------------------------------------------------+
| Status: | Running |
+------------+-----------------------------------------------------------------+
| Hints: | Last run at 2022-11-03T16:53:38Z. Pod in 'Running' phase. State |
| | 'running'. Started at '2022-11-03T16:53:40Z'. |
+------------+-----------------------------------------------------------------+
tools.billsbots@tools-sgebastion-10:~$ toolforge-jobs list
Job name: Job type: Status:
---------------- ----------- ---------
refreshlinks-k8s normal Running
tools.billsbots@tools-sgebastion-10:~$
Will wait a bit for new emails or Phabricators to come in telling me what I'm still doing wrong, before proceeding to the next step, creating scheduled jobs (cron jobs). – wbm1058 (talk) 19:12, 3 November 2022 (UTC)[reply]
- One thing I'm apparently still doing wrong is Login to Wikipedia as Bot1058 from a device you have not recently used. That's the title of an email I get every time I run a one-off job on Toolforge. The message says "
Someone (probably you) recently logged in to your account from a new device. If this was you, then you can disregard this message. If it wasn't you, then it's recommended that you change your password, and check your account activity.
" The Help button at the bottom of the email message links to mw:Help:Login notifications, which says "this feature relies on cookies to keep track of the devices you have used to log in". I'm guessing that cookies are not working in my Toolforge account.
The code I use to log in is:
|
---|
$objwiki = new wikipedia();
$objwiki->login($user, $pass);
/**
* This function takes a username and password and logs you into wikipedia.
* @param $user Username to login as.
* @param $pass Password that corrisponds to the username.
* @return array
**/
function login ($user,$pass) {
$post = array('lgname' => $user, 'lgpassword' => $pass);
$ret = $this->query('?action=query&meta=tokens&type=login&format=json');
print_r($ret);
/* This is now required - see https://bugzilla.wikimedia.org/show_bug.cgi?id=23076 */
$post['lgtoken'] = $ret['query']['tokens']['logintoken'];
$ret = $this->query( '?action=login&format=json', $post );
if ($ret['login']['result'] != 'Success') {
echo "Login error: \n";
print_r($ret);
die();
} else {
print_r($ret);
return $ret;
}
}
|
- These emails will get very annoying pretty fast if I get this task set up to run frequent, small jobs rather than infrequent, large jobs – as @SD0001: suggests. Help please! wbm1058 (talk) 13:52, 4 November 2022 (UTC)[reply]
- The login code looks ok to me. Not sure why the emails didn't stop coming after the first few times, but if necessary you can disable them from Special:Preferences notifications tab. My general tip for botops is to use OAuth, which avoids this and several other problems. – SD0001 (talk) 19:11, 4 November 2022 (UTC)[reply]
- I found a relevant Phabricator task and added my issue there. – wbm1058 (talk) 13:08, 6 November 2022 (UTC)[reply]
- I think I solved this. Per comments in the Phab, as my bot only logged in and didn't make any edits, the IP(s) weren't recorded in the CheckUser table and every log in was treated as being from a "new" IP. To work around this, I did some one-off runs of another task this bot has which does actually make edits. After running that bot task a few times on the Toolforge, the emails stopped coming, even for the task that just refreshes links and doesn't make any edits.
- But in the meantime before I figured that out, I searched for OAuth "quick start" links, and am posting my finds here:
- What the Heck is OAuth?
- OAuth
- https://oauth.net/
- wikitech:OAuth
- mw:OAuth (disambiguation page)
- mw:Extension:OAuth – the OAuth extension implements an OAuth server in MediaWiki that supports both the OAuth 1.0a and OAuth 2.0 protocol versions.
- mw:OAuth/For Developers
- mw:OAuth/Owner-only consumers
- mw:Help:OAuth
- mw:Core Platform Team/Initiatives/OAuth2
- meta:Special:OAuthConsumerRegistration/propose
- At some point while navigating this forest of links, my mind exploded. I'm putting OAuth on my back burner now, to focus on creating scheduled jobs. Meanwhile I have these links saved here so I may come back to this at some point. – wbm1058 (talk) 15:46, 10 November 2022 (UTC)[reply]
- I found a relevant Phabricator task and added my issue there. – wbm1058 (talk) 13:08, 6 November 2022 (UTC)[reply]
- The login code looks ok to me. Not sure why the emails didn't stop coming after the first few times, but if necessary you can disable them from Special:Preferences notifications tab. My general tip for botops is to use OAuth, which avoids this and several other problems. – SD0001 (talk) 19:11, 4 November 2022 (UTC)[reply]
Job logs
On my way to creating scheduled jobs, I ran into another issue. Per wikitech:Help:Toolforge/Jobs framework#Job logs, Subsequent same-name job runs will append to the same files... there is no automatic way to prune log files, so tool users must take care of such files growing too large.
What?! How hard can it be to offer a "supersede" option to override the default "append"? – wbm1058 (talk) 22:07, 12 November 2022 (UTC)[reply]
- I've raised this issue in T301901. – wbm1058 (talk) 09:59, 13 November 2022 (UTC)[reply]
- A "supersede" option sounds like a bad idea as that would mean you can only ever see the logs of the latest job run. – SD0001 (talk) 14:00, 25 January 2023 (UTC)[reply]
- @SD0001: I get your point, but wikitech:Help:Toolforge/Jobs framework#Job logs says
Log generation can be disabled with the
. If it makes sense to sometimes disable logs entirely, why wouldn't it also make sense to sometimes supersede them? All logs for bots running on my desktop PC are always superseded. That's usually not a problem, but sometimes it would be nice to be able to go back and look at a previous log to see what happened on the run where a bug first surfaced. The logs for this task are quite long though.--no-filelog
parameter when creating a new job - I've successfully started running this bot's tasks 3, 4, and 5 as a scheduled hourly task on the jobs framework, see User:Bot1058#Tasks. The logs for those tasks are usually pretty short though, so it does make sense to append there. – wbm1058 (talk) 16:10, 27 January 2023 (UTC)[reply]
- @SD0001: I get your point, but wikitech:Help:Toolforge/Jobs framework#Job logs says
- A "supersede" option sounds like a bad idea as that would mean you can only ever see the logs of the latest job run. – SD0001 (talk) 14:00, 25 January 2023 (UTC)[reply]
Abandoned complicated workaround after T301901 closed
|
---|
@SD0001: I'm trying to implement the somewhat complicated workaround given at wikitech:Help:Toolforge/Jobs framework#Custom log files. I've added some explanations to this section (see the edit history) so let me know if I added anything that's not correct. I take the following as instructions to type the following directly from my PuTTY keyboard.
tools.mytool@tools-sgebastion-11:~$ cat > log-wrapper.sh <<EOF
> #!/bin/sh
> jobname=$1
> command=$2
> mkdir -p logs
> sh -c $command 1>>logs/${jobname}.log 2>>logs/${jobname}.log
> EOF
tools.mytool@tools-sgebastion-11:~$ chmod a+x log-wrapper.sh
After doing that I notice that the $1 and $2, and $command and ${jobname}, were eaten somehow. The contents of my #!/bin/sh
jobname=
command=
mkdir -p logs
sh -c 1>>logs/.log 2>>logs/.log
which doesn't seem right to me. Of course I can just copy-paste the contents of the file from the Help: page directly with WinSCP, rather than type them in with PuTTY (which I did). If this Help: page isn't giving instructions that work, it should be corrected. I've made a couple of unsuccessful attempts, and something was obviously wrong with my syntax. – wbm1058 (talk) 19:06, 17 November 2022 (UTC)[reply] ./php/refreshlinks.php: 1: cannot open ?php: No such file
./php/refreshlinks.php: 2: /bin: Permission denied
./php/refreshlinks.php: 3: log-wrapper.sh: not found
./php/refreshlinks.php: 4: log-wrapper.sh: not found
./php/refreshlinks.php: 5: Syntax error: word unexpected (expecting ")")
|
- Kubernetes' beta phase has been declared done and the new phab:T327254 "next steps in grid engine deprecation" has opened. But Job logs still says
there is no automatic way to prune log files, so tool users must take care of such files growing too large
. Huh? I guess I made the mistake of trying to piggyback on an existing Phab rather than opening a new one. – wbm1058 (talk) 13:29, 25 January 2023 (UTC)[reply]
- @Wbm1058 For now, I would suggest not worrying about pruning log files. It would take a long time before the logs grow big enough to be of any concern, at which time you could just delete or truncate it manually. – SD0001 (talk) 14:01, 25 January 2023 (UTC)[reply]
- OK, now I'm tracking T327165 and hoping that will provide me with a solution. – wbm1058 (talk) 17:43, 18 August 2023 (UTC)[reply]
- @Wbm1058 For now, I would suggest not worrying about pruning log files. It would take a long time before the logs grow big enough to be of any concern, at which time you could just delete or truncate it manually. – SD0001 (talk) 14:01, 25 January 2023 (UTC)[reply]
OK, I think I'm close to having this wired. Per the advice above to process just 5-10k titles at a time, and run the script more frequently
I've set the LIMIT for database lookups to 10000 and am now running the refreshlinks script as a continuous job. If the number of pages processed in the previous run is less than 250, then it sleeps 20 minutes before hitting the database again; otherwise it just sleeps for two minutes. Command I used to get it started:
toolforge-jobs run refreshlinks --command "php ./php/refreshlinks.php" --image php8.2 -o ./logs/refreshlinks.log -e ./logs/refreshlinks.log --continuous
I don't get the impression it runs any faster on the Toolforge than it runs on my 12-yr old desktop; if anything it seems to be running a little slower on Toolforge.
If this setup is OK then I'll set up my other script refreshmainlinks to run this way too. At the moment that one is still running on my desktop as it has been since I filed this BRFA. – wbm1058 (talk) 21:49, 9 February 2023 (UTC)[reply]
Tracking the run times for processing 10,000 pages (extracted from the job logs):
- 1:22:00
- 1:44:50
- 1:05:38
- 1:31:03
- 1:09:35
- 1:31:15
- 1:38:56
- 1:41:52
- 1:40:20
- 1:38:02
- 1:36:09
- 1:24:08
- 1:23:27
- 1:24:41
- 1:23:19
- 1:26:14
- 1:12:31
- 1:17:15
- 1:18:33
- 1:08:25
- 1:09:03
- 1:07:40
- 1:11:57
- 1:09:44
- 1:07:25
- 1:20:23
- 1:13:55
- 1:12:13
- 1:11:36
- 1:08:31
- 1:02:30
- 1:12:53
It doesn't seem like I've gained anything from having the server running your code and the database are on the same hi-speed network
. I haven't looked into how my process on Toolforge may be resource-limited and how to request more resources. I haven't really noticed any reliability issues running on my desktop, at least not as much as I had last August. – wbm1058 (talk) 13:50, 10 February 2023 (UTC)[reply]
My bot has found a new problem, which I've reported at WP:VPT#new fat: project conflicting with a couple English wiki article titles. – wbm1058 (talk) 12:37, 30 July 2023 (UTC)[reply]
After many months running refreshlinks on the Toolforge and refreshmainlinks on my PC, I've shut down all link-refreshing processes that were running on my PC and have started running refreshmainlinks on the Toolforge with the following command:
toolforge-jobs run refreshmainlinks --command "php ./php/refreshmainlinks.php" --image php8.2 -o ./logs/refreshmainlinks.log -e ./logs/refreshmainlinks.log --continuous
Now all (both) link-refreshing processes are running on the Toolforge. – wbm1058 (talk) 19:13, 14 August 2023 (UTC)[reply]
I've updated the Function details to reflect how the current versions 3.2 of my PHP code work. – wbm1058 (talk) 12:41, 16 August 2023 (UTC)[reply]
Approved. This bot task performs no logged edits or actions and appears to have been running fine for quite a while now. – SD0001 (talk) 17:56, 26 August 2023 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at Wikipedia:Bots/Noticeboard.