r58712 MediaWiki - Code Review archive

Repository:MediaWiki
Revision:r58711‎ | r58712 | r58713 >
Date:15:13, 7 November 2009
Author:daniel
Status:deferred (Comments)
Tags:
Comment:
allow RDFa attributes; missing support for <a>, will be added as a parser tag hook
Modified paths:
  • /trunk/phase3/includes/Sanitizer.php (modified) (history)

Diff [purge]

Index: trunk/phase3/includes/Sanitizer.php
@@ -626,6 +626,16 @@
627627 $wgEnforceHtmlIds ? 'noninitial' : 'xml' );
628628 }
629629
 630+ //RDFa properties allow URIs. check them
 631+ if ( $attribute === 'rel' || $attribute === 'rev' ||
 632+ $attribute === 'about' || $attribute === 'property' || $attribute === 'resource' ||
 633+ $attribute === 'datatype' || $attribute === 'typeof' ) {
 634+ //Paranoia. Allow "simple" values but suppress javascript
 635+ if ( preg_match( '/(^|\s)javascript\s*:/i', $value ) ) {
 636+ continue;
 637+ }
 638+ }
 639+
630640 // If this attribute was previously set, override it.
631641 // Output should only have one attribute of each name.
632642 $out[$attribute] = $value;
@@ -1154,7 +1164,11 @@
11551165 * @return Array
11561166 */
11571167 static function setupAttributeWhitelist() {
1158 - $common = array( 'id', 'class', 'lang', 'dir', 'title', 'style' );
 1168+ $common = array( 'id', 'class', 'lang', 'dir', 'title', 'style',
 1169+ #RDFa attributes as specified in section 9 of http://www.w3.org/TR/2008/REC-rdfa-syntax-20081014
 1170+ 'about', 'property', 'resource', 'datatype', 'typeof',
 1171+ );
 1172+
11591173 $block = array_merge( $common, array( 'align' ) );
11601174 $tablealign = array( 'align', 'char', 'charoff', 'valign' );
11611175 $tablecell = array( 'abbr',

Comments

#Comment by MaxSem (talk | contribs)   15:31, 7 November 2009

Possible attack vectors: javascript/**/:alert(document.cookie) vbscript:alert(document.cookie)

#Comment by Duesentrieb (talk | contribs)   15:46, 7 November 2009

better pattern in r58714

#Comment by Simetrical (talk | contribs)   00:45, 8 November 2009

Some issues here:

Validity in XHTML1/HTML5
RDFa is neither valid XHTML 1.0 Transitional, nor valid HTML5. If we want to continue conforming to a particular standard, we need to change which standard we're conforming to, at least. We might want to make this a switch like $wgHtml5. Is there a validator for XHTML+RDFa, or HTML5+RDFa?
Validity as RDFa
As far as I can tell, this change doesn't allow conformant RDFa+XHTML or RDFa+HTML5 documents to be created. Most of the attributes you've whitelisted don't accept URLs, they only accept CURIEs (see <http://www.w3.org/TR/rdfa-syntax/#col_Metainformation>). However, to declare a CURIE, AFAICT, you need to use a property prefixed with xmlns: <http://www.w3.org/TR/rdfa-syntax/#s_curieprocessing>, which is not allowed. So it looks like this doesn't actually permit legitimate RDFa in wikisyntax. That seems to defeat the point.
RDFa vs. microdata
RDFa is very controversial right now within the HTML Working Group at the W3C. The editor of HTML5, along with the representatives of every major browser vendor that's commented on the issue (AFAIK), thinks RDFa is much too complicated and hard to use. HTML5 contains an alternative format for encoding semantic data, microdata: <http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#microdata>. RDFa is older and therefore has more established support, but IIRC, browser vendors are pushing microdata strongly, and (for example) plan to support it in future browser releases, while they deliberately won't support RDFa. Of course, a lot of people are committed to RDFa already, so microdata is controversial too.

So we should maybe think about whether we want to support RDFa, microdata, both, or neither. As far as I can tell, anything that you'd realistically want to do with RDFa can be done more easily with microdata, and it's fairly trivial to convert from one to the other (if you have a consumer – RDFa consumers might be easier to get right now). To allow microdata, we would want to whitelist the following attributes: itemid, itemprop, itemref, itemscope, itemtype.

It's not at all clear which of RDFa and microdata will end up winning. If possible, I think it would be best if we avoided supporting one over the other, at least by default, until the dust has settled a bit more. We don't want to get stuck with lots of content in an obsolete format. What's the justification for adding RDFa support in the first place? Would it be better suited to an off-by-default option?

Summary of suggested changes (if this is kept at all):

  • Drop the check for javascript:/vbscript:, it's both unnecessary and insufficient if it were necessary.
  • Permit every attribute that begins with "xmlns:".
  • Double-check that this works with HTML5 output, see: <http://dev.w3.org/html5/rdfa/rdfa-module.html>
  • Disable RDFa by default. If a particular community expresses interest in enabling it, we can consider enabling it for them. IMO we don't want to take sides on this format dispute right now unless there's a very good reason. (And if we do it should be on the side of microdata.  ;) )
#Comment by Duesentrieb (talk | contribs)   10:20, 8 November 2009

i think rdfa vs. microdata is not something we should decide on the technology level, we should aim to allow both; basically, allow people to generate the (html) output they want, as long as it can be done safely. The current use case is to embed rdfa as specified by creative commons to represent license metadata, especially for images, so it gets picked up by google, the cc-crawler, etc.

About the identifiers: as far as I understood, it's possible to use full URIs as well as CURIEs as values, but i'll check again. If URIs are possible, I think it's prudent to filter out the most obviously nasty values.



#Comment by Simetrical (talk | contribs)   14:56, 8 November 2009

Allowing both RDFa and Microdata seems undesirable from a complexity standpoint. However, if we do need one (for license metadata or such), then allowing both would seem safest for now. I think it would be best in this case to at least allow them to be opt-out by config options, though, if not opt-in. Opt-in would probably be bad if they're used in Wikimedia content; that should Just Work.

Regarding the identifiers: I don't know anything about RDFa, but people who know more than I do pointed me to those parts of the spec. As far as I can tell from <http://www.w3.org/TR/rdfa-syntax/#col_Metainformation>, datatype, typeof, property, rel, and rev accept only CURIE(s) (or for rel/rev, also reserved words). URIs are only accepted for about or resource. For the others, http://foo.com/ would be interpreted as a CURIE with prefix part "http" and reference part "//foo.com/", which won't work because "http" is not a defined prefix.

Finally, for "unsafe" URIs, I don't think it's a good idea to have feel-good measures that make it look like we're more secure when we actually aren't. It might give people a false sense of security, and/or make us look silly. There should be no problem with allowing javascript: or vbscript:, and if there is, we'd need much stricter checking to achieve any semblance of security.

#Comment by Duesentrieb (talk | contribs)   20:55, 13 November 2009

adressed several concerns in r59024, more info there

#Comment by Simetrical (talk | contribs)   23:42, 10 December 2009

Still some unaddressed concerns here, setting from deferred back to fixme. (deferred is only for stuff that doesn't affect Wikimedia, I thought?)

#Comment by Simetrical (talk | contribs)   00:10, 11 December 2009

List of subsequent RDFa-related commits: r58717, r59024, r59027, r59032, r59033, r59035, possibly others.

#Comment by Tim Starling (talk | contribs)   02:18, 11 January 2010

My comments on this project are at r58717.

#Comment by Tim Starling (talk | contribs)   06:40, 9 February 2010

It looks pretty harmless now that $wgAllowRdfaAttributes is off by default. Can we mark these commits as deferred for now?

#Comment by Simetrical (talk | contribs)   11:24, 9 February 2010

Fine by me.

Status & tagging log