Page MenuHomePhabricator

PDF file has 0x0 image size in Commons after uploading a new version while the page number is correct
Open, Needs TriagePublicBUG REPORT

Description

After uploading a new version of this file to Commons:
https://commons.wikimedia.org/wiki/File:PL_Hudson_Jej_naga_stopa.pdf
it has 0x0 image size. (The initial version reported non-zero size at the moment)

Note: 0x0 size is a SYMPTOM that can have many causes (something failed while reading the uploaded file). This ticket specifically deals with a recurring case on WIKIMEDIA, where (not consistently) after upload the size becomes 0x0.

Reverting to the previous version fixed the problem just for few minutes, and then its image size was zeroized as well and reappeared again after an hour.

No problem with this file after upload to test Wikipedia:
https://test.wikipedia.org/wiki/File:PL_Hudson_Jej_naga_stopa-test.pdf

Note: this does not seem to be duplicate of T297942 as unlike
https://commons.wikimedia.org/wiki/File:Guinault_-_Sergent_!_(1881).pdf
this file has correct page numbers.

Event Timeline

Page number is properly set, so this problem seems to be different to T298417

Ankry renamed this task from PDF file has 0x0 image size in Commons after uploading a new version to PDF file has 0x0 image size in Commons after uploading a new version while page number is corect.Jan 19 2022, 2:49 PM
Ankry renamed this task from PDF file has 0x0 image size in Commons after uploading a new version while page number is corect to PDF file has 0x0 image size in Commons after uploading a new version while the page number is correct.
Ankry reopened this task as Open.
Ankry updated the task description. (Show Details)

Finally, I tried deleting all versions except the last one, but it didn't work. So I deleted everything, and I reupload the file, and it shows fine. The bug still exist through. And it doesn't show properly on Wikisource.

I'm getting a similar problem with this PDF:

(currently it is broke just on la.ws, as I reverted to an old version.)

I'm getting a similar problem with this PDF:

(currently it is broke just on la.ws, as I reverted to an old version.)

An administrator can try to upload this file locally. This sometimes works as a workaround, but hides the problem.

,snip>
An administrator can try to upload this file locally. This sometimes works as a workaround, but hides the problem.

Are you able to give me some instructions or tips on how this is done so our administrators can assess if it is easy to try or not?

,snip>
An administrator can try to upload this file locally. This sometimes works as a workaround, but hides the problem.

Are you able to give me some instructions or tips on how this is done so our administrators can assess if it is easy to try or not?

Download the file from Commons. Visit Special:Upload on your wiki and upload the file, telling it to continue anyway (I think it will complain it exists on commons)

Just as a general comment... While it's helpful if you report these issues, if you revert these uploads several times, while only linking to the general File: page, it makes it hard for anyone to try and debug the issue, as they can't necessarily find which is the "broken" version of the PDF...

@Reedy Yes I see that. I'm afraid I was trying to brute force solve this problem, but all of the attempts failed. I'll remember to link to file versions in future.

As it goes the current version is still broken on LA WS, and I won't be moving it again as I am hoping an administrator will move a copy onto the wiki as per your instructions.

I went through a similar issue while overwriting this PDF file: https://commons.wikimedia.org/wiki/File:%E8%A8%93%E8%92%99%E5%AD%97%E6%9C%83.pdf
It works fine at Commons, but is shown inconsistent at other wikis.
For example: Wikipedia-en and Wikisource-ko

I've stepped through the logic a bit with some of the reported files, and pretty much the only reason this can happen, is because pdfinfo is not defined/executable/incorrect, or $wgPdfHandlerDpi being 0/undefined.

Both would be silent errors when they occur.

Another option is that the file that boxedcommand creates with the output of the metadata, is not available for the MediaWiki app server for some reason. This too would be a silent error.

D6283's comment indicates that this is still happening. If all pdfinfo work is now a boxedcommand and running in kubernetes, then this indicates that some hosts either don't have pdfinfo, or that there are occasional issues with the output of the command. A race condition with the file flush?

I've stepped through the logic a bit with some of the reported files, and pretty much the only reason this can happen, is because pdfinfo is not defined/executable/incorrect, or $wgPdfHandlerDpi being 0/undefined.

Both would be silent errors when they occur.

This generally happens on large PDF files. Maybe there is a memory/time/other limit that pdfinfo execution exceeds?

@Ankry That would be something that points more to the second possibility. A race condition with the file not yet being available to the process trying to read the metadata.

The file gets uploaded, written to the main file server, then the metadata reading starts and the file is not there at all, or not yet completely written to the location where the metadata is trying to read that file from (a replica server).

Change 1011371 had a related patch set uploaded (by TheDJ; author: TheDJ):

[mediawiki/extensions/PdfHandler@master] Improve logging for Pdf's retrieveMetadata.sh

https://gerrit.wikimedia.org/r/1011371

The patch should make any errors more verbose so that we can collect more information about these failures.

Change 1011371 merged by jenkins-bot:

[mediawiki/extensions/PdfHandler@master] Improve logging for Pdf's retrieveMetadata.sh

https://gerrit.wikimedia.org/r/1011371

&action=purge seems to solve this issue. Can we search for other PDF files that have this issue?

Can confirm:

  1. that the problem is present both at Commons and every local wiki
  2. ?action=purge on a local description page (e.g. Spanish Wikisource Archivo:Filename.pdf) solves the problem locally, but only after '?action=purge'ing the Commons description page.

The patch should make any errors more verbose so that we can collect more information about these failures.

@TheDJ Based on buzz around various places this problem seems to be markedly more prevalent currently, and has been for anything from around the last week to a couple of weeks (guesstimate based on how long it usually takes such issues to bubble up to catch attention).

We're mostly seeing this for Index:-namespace pages, possibly because 1) they are nearly always depending on multi-page files (vs. plain jpegs), 2) they actually always need the page numbers and other extended metadata, and 3) they are very often created shortly after the file is uploaded.

Anecdotally (and I verified in one case), the file on Commons looks fine, the file on the local wiki looks broken (no thumb), and the Index: page displays the above error. Purging the file on Commons has no effect, but purging the non-existent local file description page fixes it. This last is different from previous problems that looked similar in that no amount of purging of anything seemed to have an effect in those cases, except maybe exceptionally and randomly (so we're probably talking a cluster of similar problems).

But that it appears significantly more prevalent right now means something changed somewhere in the stack and combined with the extended logging it may be possible to pinpoint the cause.

I don't know if this could add something to the discussion, anyway on my mw installation (MediaWiki 1.40.0; PHP 8.3.7 (apache2handler); ICU 70.1 ; MariaDB 10.11.8) I am having the same issue. If I try to generate the thumbs of a pdf via script, I get this:
Deprecated: strlen(): Passing null to parameter #1 ($string) of type string is deprecated in /var/www/html/mediawiki/thumb.php on line 362

Deprecated: strlen(): Passing null to parameter #1 ($string) of type string is deprecated in /var/www/html/mediawiki/thumb.php on line 362

Passing null to strlen() was deprecated in PHP 8.1, but this is probably just a symptom. Whatever is being passed to strlen() should presumably not be null and is so because something failed earlier in the process.

This "something" could be anything, including datacenters being out of sync, JobQueue jobs timing out, DB queries timing out, pathological data that makes Ghostscript spit out something the PDF handler doesn't handle.

For example, it looks like FileRepo\ThumbnailEntryPoint.php mostly sanitizes what it passes to strlen() (using null coalescing), but in one instance, when trying to generate a Content-Length header, it passes $content directly so that if it fails to get the thumbnail data it'll throw that error.

Deprecated: strlen(): Passing null to parameter #1 ($string) of type string is deprecated in /var/www/html/mediawiki/thumb.php on line 362

This "something" could be anything, including datacenters being out of sync, JobQueue jobs timing out, DB queries timing out, pathological data that makes Ghostscript spit out something the PDF handler doesn't handle.

Well in this case it is https://github.com/wikimedia/mediawiki/blob/1.40.0/thumb.php#L362
So thumbproxyurl is null. This was fixed with null coalescing in a patch release of 1.40