User Details
- User Since
- Oct 31 2014, 3:23 AM (525 w, 1 d)
- Roles
- Disabled
- LDAP User
- GWicke
- MediaWiki User
- Unknown
Feb 2 2024
Apr 29 2023
Jun 11 2022
Jun 20 2020
Oct 8 2018
Sep 21 2018
Sep 13 2018
Aug 13 2018
Jun 1 2018
Mar 25 2018
Mar 23 2018
Mar 21 2018
Feb 26 2018
Feb 11 2018
Jan 31 2018
Jan 4 2018
Jan 3 2018
Nov 9 2017
Oct 16 2017
I don't think we are clear on the data requirements yet. However, the data-mw separation has been delayed repeatedly, so this hasn't been an issue yet.
Oct 14 2017
It might be worth focusing more on robustness than simple-page latency, as that is the more critical issue with Electron. Previously, I tested with a few very large articles (see T142226#2537844). This tested timeout enforcement. Testing with a simulated overload (many concurrent requests for huge pages) could also be useful to ensure that concurrency limits and resource usage limits are thoroughly enforced.
Oct 11 2017
There is now a commercial product offering ServiceWorker execution at the CDN level: https://blog.cloudflare.com/introducing-cloudflare-workers/
This isn't going to happen any more.
If-unmodified-since support is implemented in https://github.com/wikimedia/budgeteer.
Basic implementation: https://github.com/wikimedia/budgeteer.
Oct 2 2017
See T172815 for our general thinking on robust PDF rendering based on the experience with OCG and Electron. It boils down to using a fresh render process per request & thoroughly controlling its resource consumption and maximum runtime.
Electron hangs have little to do with high rates, and very much to do with specific requests for very large / complex pages, and a single backend worker being used for many consecutive requests.
@Pchelolo, agreed that the race is not critical. It is essentially just the normal delay in propagating new restrictions, and should be short in any case.
Sep 29 2017
Queues caused many of the issues with OCG. I would really advise you to stick to a simple stateless HTTP service. Such a service offers sane error handling, provides back-pressure, integrates well with caching and rate / concurrency limiting infrastructure, and is easy to test and reason about. Once you add a queue & separate request from response, you lose all of this.
Sep 28 2017
Sep 27 2017
This metric would perfectly complement the REST equivalent in https://grafana.wikimedia.org/dashboard/db/api-summary?orgId=1, and as a result give us direct information on overall API use.
Sep 26 2017
In today's team sync meeting, we briefly touched on the possibility of combining the migration to the restriction table with the move to Cassandra 3. I think combining the two is attractive, as it lets us leverage the parallel double write / read testing we are doing anyway to test the new restriction storage as well. Doing this migration also lets us drop the revision table, in favor of action API requests for the few direct requests to the /title/{title} endpoint (T158100).
Sep 25 2017
All wiktionaries combined are only about 20m pages of typically moderate complexity and relatively low access volume. I think in this case it might be fine to just nuke all wiktionary content.
Sep 22 2017
@Tgr added this on a related mail thread:
Sep 20 2017
If I recall correctly, ResourceLoader client code on desktop already looks at a list of modules needed in a given page, checks client side caches, and fetches the remaining modules from the RL API (in a single call), and caches those modules separately in localstorage. Given that this discussion is making no reference to this, I am getting the impression that this understanding might be wrong. Could you clarify?
@Fjalapeno, that comment touches on 1), but as I said to me it looks like the API focused discussion has moved to 2). Either way, I am not sure we need a new API for either 1) or 2).
FTR, this is the graph with the alert I mentioned: https://grafana.wikimedia.org/dashboard/db/restbase?panelId=12&fullscreen&orgId=1
Sep 19 2017
The currently deployed kartotherian uses tough-cookie v2.3.2, but [also pulls in v2.2.2 as a dependency of tilelive-vector](https://github.com/wikimedia/maps-kartotherian-deploy/blob/master/node_modules/tilelive-vector/node_modules/tough-cookie/package.json#L103).
At today's team sync we agreed with @Pchelolo's proposal:
I honestly don't have a strong preference between the other "hearted" tasks. Given that all of them are fairly low volume, would it make sense to just deploy all of the hearted ones in the next wave?
Sep 18 2017
It sounds like there are two separate questions:
@Pnorman, given that this is related to the request library, do you actually see a way to make the kartotherian service fetch a HTTPS? resource from an attacker-controlled site?
I strongly support @Tgr's access request as well.
Added the "fetch metrics from graphite / prometheus" option.
Sep 14 2017
Looks like adding the JSON_UNESCAPED_UNICODE flag should do it: http://php.net/manual/en/function.json-encode.php
Sep 13 2017
Given the useful information we have in this task, I am proposing to widen the scope beyond the first job, towards generally coordinating the order of migrating individual jobs. @mobrovac, does that sound reasonable to you?
We briefly discussed this during today's sync meeting. While there are ways to set up targeted processing priorities for specific jobs (by wiki, type, or other criteria), we realized that there will likely be less of a need for this in the new setup. The Redis job queue divides processing throughput evenly between projects. This makes it relatively likely for individual projects to accumulate large backlogs, which would then need manual intervention (re-prioritization) to address.
Raised priority, as this is a) blocking the migration to the Kafka job queue backend (T157088), and b) is likely already causing performance and possibly reliability issues in the current job queue.
Sep 12 2017
As far as I can tell, the page image(s) are handled as part of deferred linksUpdate processing. This means that the updates would be executed after the main web request, but on the same PHP thread that handled the original edit request.
Considering the scalability limits of Cassandra's schema synchronization we see in production, I think it would be good to reduce the number of storage groups more aggressively. Perhaps something like this?
Sep 11 2017
Update from our month-end check-in:
@bearND, MediaWiki's section edit feature is implemented without knowledge of a DOM, so <div> wrappers do not suppress edit sections. Example: https://en.wikipedia.org/wiki/User:GWicke/TestSections with source
I believe it was the pageimages designation for those articles I mentioned above. Not exactly sure what happened on wiki since the revisions have been deleted from public archives (and I don't have the permission to view it).
Just to clarify what exactly happened here: The offending edits were adding an image to the featured page itself, and also nominated that image to be the pageimage?
Yay! 🎆
@Ottomata, from a cursory look at those connectors, it looks like they all aim to capture all SQL updates (update, insert, delete). They don't seem to be targeted at emitting specific semantic events, such as the ones we are interested in for EventBus. This is where the SQL comment idea could help, by letting us essentially embed the events we want to have emitted in the statement, rather than trying to reverse-engineer an event from raw SQL statement(s).
Looking at the three custom changes we did on top of upstream in https://github.com/wikimedia/swagger-ui/commits/master, it seems that the build process we ran after each did not update the source map. However, the gulpfile defines "dist" to be part of the default task (see https://github.com/wikimedia/swagger-ui/blob/master/gulpfile.js#L188). Perhaps we "just" forgot to check in the updated source maps? /cc @Pchelolo
In terms of document structure, the behavior in line two (add section around <div>-wrapped heading) seems to make sense. I think it also matches edit section behavior, which should ignore the <div> completely (as it is not DOM-based).
Sep 8 2017
From a practical perspective, I think the biggest question is how common clients behave these days when must-revalidate is omitted, and the client cache timeout expires. My memory on this is rather foggy, but I *think* in the dark ages behavior in that area was inconsistent, with early IE versions not re-validating even when they were online. If we can verify that all browsers we care about do the right thing (check as if must-revalidate was set when connected), then dropping must-revalidate in the headers would be harmless.
We already support fetching specific HTML sections by ID in the REST API (see https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_page_html_title), but until consistent <section> wrapping with a sensible granularity & perhaps a predictable section ID for the lead section are implemented in Parsoid (T114072), this is not as useful in practice as it could be.
This proposed optimization is similar to something I implemented in Parsoid's HTML5 serializer. In that case, we switch between single & double quotes for HTML attributes depending on whether the attribute value contains more single quotes or double quotes. This had a very significant impact on Parsoid HTML size, mainly because it has many JSON values embedded in attributes.
@Pchelolo, based on our previous conversation about this I am assuming that the bulk of the task is a very large list of pages. Is this correct?
Sep 7 2017
Facebook actually heavily relies on SQL comments to pass event information to binlog tailer daemons (see the TAO paper). We currently use those SQL comments only to mark the source of a SQL query (PHP function), but could potentially add some annotations that would make it easy to generically extract & export such events into individual Kafka topics.
Starting a new section when encountering a new heading of the same level is expected behavior, in line with MediaWiki section edit behavior, as well as HTML5 semantics. When encountering a heading of a higher level (higher number, lower prominence), the sectioning code I wrote in parsoid-utils creates a nested section. This is in line with typical HTML5 section and page outline semantics: https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Using_HTML_sections_and_outlines.
Rebased PR now ready at https://github.com/wikimedia/change-propagation/pull/203.
I don't have strong views on how to scale metrics and log collection. In any case, we have been doing this remotely for a while now (using standard formats like gelf for logs), so whether things are aggregated per pod or more centrally doesn't make a big difference to the services themselves.