Page MenuHomePhabricator

Suggested file name on Special:Upload should not contain illegal characters
Closed, ResolvedPublic

Description

We really shouldn't suggest illegal filenames.

For example, on upload by url, putting http://upload.wikimedia.org/wikipedia/commons/e/e0/Petrorhagia_prolifera_%281%29.JPG into the url box, our js determines that Petrorhagia_prolifera_%281%29.JPG is a good destination name (which is illegal because it conflicts with percent encodings). with url's, %-encoding issues will probably be somewhat common, but could also happen for normal upload from local file.

This is a fairly low priority thing imo.


Version: 1.20.x
Severity: enhancement

Details

Reference
bz30390

Event Timeline

bzimport raised the priority of this task from to Low.Nov 21 2014, 11:53 PM
bzimport set Reference to bz30390.

karun.84 wrote:

patch that removes url encoding in file name

Hello,
I have made a patch to the javascript destination file name, to call the javascript unescape function, that will remove the URL encoding where it exists.

attachment 30390.patch ignored as obsolete

(In reply to comment #2)

Created attachment 9409 [details]
patch that removes url encoding in file name

Hello,
I have made a patch to the javascript destination file name, to call the
javascript unescape function, that will remove the URL encoding where it
exists.

Thank you for patch :). However, I think a more complicated function is needed. unescape doesn't handle unicode properly (unescape("%C4%A3") => Ä£ where it should be ģ) and the set of illegal characters in a file name don't entirely match up with the set of illegal chars in a url (Things like ][{}|#<> can't appear in titles).

Just getting rid of the percent-encoding issue would probably be a major improvement (decodeURIComponent works much better than unescape for that), but there might already be a function available for title validation in our js stuff (I'm not sure, not all that familar with the js stuff)

attachment 30390.patch ignored as obsolete

karun.84 wrote:

(In reply to comment #3)

(In reply to comment #2)

Created attachment 9409 [details]
patch that removes url encoding in file name

Hello,
I have made a patch to the javascript destination file name, to call the
javascript unescape function, that will remove the URL encoding where it
exists.

Thank you for patch :). However, I think a more complicated function is needed.
unescape doesn't handle unicode properly (unescape("%C4%A3") => ģ where it
should be ģ) and the set of illegal characters in a file name don't entirely
match up with the set of illegal chars in a url (Things like ][{}|#<> can't
appear in titles).

Just getting rid of the percent-encoding issue would probably be a major
improvement (decodeURIComponent works much better than unescape for that), but
there might already be a function available for title validation in our js
stuff (I'm not sure, not all that familar with the js stuff)

Hello,
Does decodeURIComponent handle unicode correctly? I will modify my patch to use this.

attachment 30390.patch ignored as obsolete

karun.84 wrote:

second version of patch using decodeURIcomponent

I have modified my previous patch to use the decodeURIComponent function rather than un escape.

attachment 30390-v2.patch ignored as obsolete

Hello,
Does decodeURIComponent handle unicode correctly? I will modify my patch to use
this.

Yes, it does. Perhaps something like decodeURIComponent(theSuggestedName).replace( /[\]\[\{\}\|#<>:]/g, '' );

(I suppose colon can be valid in certain configs, but they seem few - mostly that file namespace repo extension)

(although, you do have to check for exceptions with decodeURIComponent, in case of invalid utf-8 - aka decodeURIComponent("%C4") will throw an exception).

karun.84 wrote:

I don't think we should do any modification to the filename such as replace, other than to decode the URI component/%-encoding.

karun.84 wrote:

patch that removes url encoding in file name

Hello,
I have made a new version of the patch for this bug.
We decodeURIComponent to decode any URI coding in the destination file name.
Where an exception occurs as a result invalid utf-8 as described in example in the bug report, we use the destination file name, without the uri coding removed, if there is any.

No other changes are made to the destination file name, such are removing any other characters and replacing them.

attachment 30390-v3.patch ignored as obsolete

sumanah wrote:

Karun, I'm sorry for the slow response to your patch.

https://www.mediawiki.org/wiki/Git/Tutorial shows you how you can directly get your patch into our source control system to get faster response and review. Thank you again for your contribution!

karun.84 wrote:

(In reply to comment #9)

Karun, I'm sorry for the slow response to your patch.

https://www.mediawiki.org/wiki/Git/Tutorial shows you how you can directly get
your patch into our source control system to get faster response and review.
Thank you again for your contribution!

Thanks, I will have a look at it. Previously it took too long with many bugs before they were reviewed, so I gave up contributing to the project.

sumanah wrote:

Karun, thank you for committing the patch as https://gerrit.wikimedia.org/r/#/c/20532/ . And you have my regrets and apologies regarding those delays. The Wikimedia engineering community is trying to get better and faster at triaging bugs and reviewing and merging patches and I hope to see your contributions again. :)

Thanks, I will have a look at it. Previously it took too long with many bugs
before they were reviewed, so I gave up contributing to the project.

And I would like to add my apology as well, since I was actively commenting on the bug, that's really my fault.

karun.84 wrote:

(In reply to comment #12)

Thanks, I will have a look at it. Previously it took too long with many bugs
before they were reviewed, so I gave up contributing to the project.

And I would like to add my apology as well, since I was actively commenting on
the bug, that's really my fault.

Ive submitted the patch to Gerrit. It looks like Gerrit is going to help with having patches reviewed quicker.

Comment on attachment 9423
patch that removes url encoding in file name

Marking patch obsolete as it is in Gerrit now (I3a48f8a8).