Page MenuHomePhabricator

TwoColConflict selenium tests run in parallel trigger db errors
Closed, ResolvedPublic

Description

The Two-Column-Edit-Conflict-Merge selenium tests can be run in parallel, as the setup was adjusted to avoid conflicts between the specs when run concurrently. This works fine, when I test it locally with my docker environment.

On the CI the current setup triggers DB errors though. See for example in here

Error: internal_api_error_DBQueryError: [a855e8f84ae77e552fdcff6b] Exception caught: A database query error has occurred. This may indicate a bug in the software.
09:20:28     at /workspace/src/extensions/TwoColConflict/node_modules/mwbot/src/index.js:262:31
09:20:28     at tryCatcher (/workspace/src/extensions/TwoColConflict/node_modules/bluebird/js/release/util.js:16:23)
09:20:28     at Promise._settlePromiseFromHandler (/workspace/src/extensions/TwoColConflict/node_modules/bluebird/js/release/promise.js:547:31)
09:20:28     at Promise._settlePromise (/workspace/src/extensions/TwoColConflict/node_modules/bluebird/js/release/promise.js:604:18)
09:20:28     at Promise._settlePromise0 (/workspace/src/extensions/TwoColConflict/node_modules/bluebird/js/release/promise.js:649:10)
09:20:28     at Promise._settlePromises (/workspace/src/extensions/TwoColConflict/node_modules/bluebird/js/release/promise.js:729:18)
09:20:28     at _drainQueueStep (/workspace/src/extensions/TwoColConflict/node_modules/bluebird/js/release/async.js:93:12)
09:20:28     at _drainQueue (/workspace/src/extensions/TwoColConflict/node_modules/bluebird/js/release/async.js:86:9)
09:20:28     at Async._drainQueues (/workspace/src/extensions/TwoColConflict/node_modules/bluebird/js/release/async.js:102:5)
09:20:28     at Immediate.Async.drainQueues [as _onImmediate] (/workspace/src/extensions/TwoColConflict/node_modules/bluebird/js/release/async.js:15:14)
09:20:28     at processImmediate (internal/timers.js:464:21)

including the following DB error log

09:22:26 mw-dberror.log:Mon Jul 3 7:20:27 UTC 2023	10f8e54cc31d	wikidb	Error 1213 from User::addToDatabase, Deadlock found when trying to get lock; try restarting transaction INSERT IGNORE INTO `user` (user_name,user_password,user_newpassword,user_email,user_email_authenticated,user_real_name,user_token,user_registration,user_editcount,user_touched) VALUES ('User-0.9376071091678215-Iñtërnâtiônàlizætiøn','','','',NULL,'','abd32aac229246c70e706a2b599f5f87','20230703072027',0,'20230703072027') localhost:/workspace/db/quibble-mysql-jh9lekb4/socket
09:22:26 mw-dberror.log:#0 /workspace/src/includes/libs/rdbms/database/Database.php(1258): Wikimedia\Rdbms\Database->getQueryExceptionAndLog(string, integer, string, string)
09:22:26 mw-dberror.log:#1 /workspace/src/includes/libs/rdbms/database/Database.php(743): Wikimedia\Rdbms\Database->reportQueryError(string, integer, string, string, boolean)
09:22:26 mw-dberror.log:#2 /workspace/src/includes/libs/rdbms/database/Database.php(1564): Wikimedia\Rdbms\Database->query(Wikimedia\Rdbms\Query, string)
09:22:26 mw-dberror.log:#3 /workspace/src/includes/user/User.php(2769): Wikimedia\Rdbms\Database->insert(string, array, string, array)
09:22:26 mw-dberror.log:#4 /workspace/src/includes/libs/rdbms/database/Database.php(2444): User->{closure}(Wikimedia\Rdbms\DatabaseMysqli, string)
09:22:26 mw-dberror.log:#5 /workspace/src/includes/libs/rdbms/database/DBConnRef.php(119): Wikimedia\Rdbms\Database->doAtomicSection(string, Closure)
09:22:26 mw-dberror.log:#6 /workspace/src/includes/libs/rdbms/database/DBConnRef.php(658): Wikimedia\Rdbms\DBConnRef->__call(string, array)
09:22:26 mw-dberror.log:#7 /workspace/src/includes/user/User.php(2798): Wikimedia\Rdbms\DBConnRef->doAtomicSection(string, Closure)
09:22:26 mw-dberror.log:#8 /workspace/src/includes/auth/AuthManager.php(1567): User->addToDatabase()
09:22:26 mw-dberror.log:#9 /workspace/src/includes/auth/AuthManager.php(1290): MediaWiki\Auth\AuthManager->continueAccountCreation(array)
09:22:26 mw-dberror.log:#10 /workspace/src/includes/api/ApiAMCreateAccount.php(108): MediaWiki\Auth\AuthManager->beginAccountCreation(User, array, string)
09:22:26 mw-dberror.log:#11 /workspace/src/includes/api/ApiMain.php(1915): ApiAMCreateAccount->execute()
09:22:26 mw-dberror.log:#12 /workspace/src/includes/api/ApiMain.php(892): ApiMain->executeAction()
09:22:26 mw-dberror.log:#13 /workspace/src/includes/api/ApiMain.php(863): ApiMain->executeActionWithErrorHandling()
09:22:26 mw-dberror.log:#14 /workspace/src/api.php(95): ApiMain->execute()
09:22:26 mw-dberror.log:#15 /workspace/src/api.php(48): wfApiMain()
09:22:26 mw-dberror.log:#16 {main}

Each spec starts with a couple of API calls to create user accounts and prepare and edit test pages. So when run in parallel these obviously add up. Still, I'm curious what could be done to avoid these errors, since it's just fine on my local setup. Might also be relevant for other tests that need a lot of setup at spec start.

Event Timeline

That is indeed T199393. At the time Anomie wrote:

You can replace the "LOCK IN SHARE MODE" with "FOR UPDATE" without changing the behavior.

He has a repro case and suspected a lock to happen on the gap between existing users. He got the deadlock report (using innodb_print_all_deadlocks) at T199393#4420174 and I further analyzed it a few comments below at T199393#7606990. The workaround is to have the test framework to retry the user creation :/

I suggest marking this one as a dupe of T199393.

The workaround is to have the test framework to retry the user creation :/

Fine by me as a workaround. On the test's side I could probably catch the exception to avoid further hazel. I would have one remaining issue still I guess. You can see it in the example given: We have a mechanism to re-run tests if they fail once. So at the end the selenium tests are "fine" due to the re-run, but the error logs are not empty so the whole job still fails. :-/

09:22:26 + ERROR_FILES=(mw-dberror.log mw-error.log)
09:22:26 + echo 'Asserting empty files: mw-dberror.log mw-error.log'

I'm not sure if ignoring that step for selenium tests is the right way forward ^^'. I guess I'll either leave the concurrency for now, or I would work on a setup that avoids account creation. Also not highest priority on our end TBH.

zeljkofilipin subscribed.

Removing MediaWiki-Core-Tests since this task is about an extension, not core. Please add the tag back if I misunderstood it.

Mostly, that is an extension test failing and that requires a workaround for an issue which is in MediaWiki core (T199393)

The workaround is to have the test framework to retry the user creation :/

Fine by me as a workaround. On the test's side I could probably catch the exception to avoid further hazel. I would have one remaining issue still I guess. You can see it in the example given: We have a mechanism to re-run tests if they fail once. So at the end the selenium tests are "fine" due to the re-run, but the error logs are not empty so the whole job still fails. :-/

09:22:26 + ERROR_FILES=(mw-dberror.log mw-error.log)
09:22:26 + echo 'Asserting empty files: mw-dberror.log mw-error.log'

I'm not sure if ignoring that step for selenium tests is the right way forward ^^'. I guess I'll either leave the concurrency for now, or I would work on a setup that avoids account creation. Also not highest priority on our end TBH.

I don't see any solution to flush out wrongful error logs when retrying a test short of having log files to vary based on the test name (maybe by passing the test as a header which can be injected to the global variables used to forge the filenames) :-( And I guess retrying the user creation would still leave a trace in mw-dberror.log :/

Meanwhile I have filed T342088 to have Quibble to capture MariaDB error log and enable the verbose deadlock report. I have made it a subtask of the mediawiki/core issue.

@WMDE-Fisch thank you for having taken the time to recheck and verify the issue got resolved! And congratulations on making TwoColConflict Selenium tests to now be running in parallel