Jump to content

Categorization requirements

From Meta, a Wikimedia project coordination wiki

These are some abstract requirements for any proposal to introduce categorization into MediaWiki. The idea here is to have some kind of metric to measure implementation proposals against.

Goals (or, "Why Implement Categories?")

Currently the wiki allows readers to find information in a couple of ways (more?):

  • follow links in the articles, starting at the main page
  • search for a term, resulting in either an exact match, or a list of partial matches

The wiki has its own search utility, but web users can also search on Google using the "site:wikipedia.org" search criterion.

These methods of finding information will fail in the following cases:

  • searching with a term or terms not found in the document
  • searching with too general a term or terms, resulting in more hits than can be checked in a reasonable time
  • user does not know the information exists

The last case is strange, but is valid if you consider that many users might be interested in browsing for knowledge related to some subject with which they are familiar, but which no link-following strategy has allowed them to discover. In other words, the chain of links from the article currently being viewed to the potentially useful article is too long, too convoluted, or simply lost in the complexity of the network.

Other means than categories, for providing short-cuts to articles, already exist. One way is contributor-created lists. These have to be kept up to date by manual editing. This is usually quite reliable, but there is a time delay involved.

A possible solution to the problem of keeping useful lists of articles up to date is to try to find a way to partly automate the process. Some automatic processes already exist. Overly simple list generation methods tend to be comprehensive, but suffer the same problem as lists of search results. Moreover, on wikipedia, the alphabetical listing seems to have been turned off. The page loads, but this message occurs at the top: "Sorry! This feature has been temporarily disabled because it slows the database down to the point that no one can use the wiki." This demonstrates that automatic organization methods are not only hard to design in a useful way, but also in an efficient way.

Another interesting way to see a comprehensive list of articles connected with a given article is to generate localized site maps, or link graphs. These grow exponentially, however, so you can only get to two or maybe three links before they become unreadable. An article itself is already a one-degree of separation link graph. As mentioned, there is no way to be certain that all articles which a user would expect to be related would show up within three links. Network graphs are also a terribly complicated thing to diagram. They are effectively impossible on a public access server.

An answer to the weaknesses of inter-article links is to define some kind of logical entity which is not an article, but a hub of connection for articles: a category. Categories are a means of partitioning the articles into various sets based on some kind of relationship, usually taken to be subject matter. Defining subjects, and hence categories, is a challenge. The task can be made both easier and more difficult by putting various types of constraints on the nature of and means of defining categories.

Since many categorization schemes already exist in library and information science, it is common to suggest use of one of these. They have two weaknesses:

  1. they are designed for books stored in stacks in a library, not for articles
  2. they change all the time

Also, the size of most such category schemes is prohibitive, they would require some level of expertise both to implement and to use, and many categories in such schemes would be empty of articles in the wiki. Finally, adding an article to a predetermined set of categories requires the tedious task of searching for the correct category that seems to fit.

The alternative is to allow wiki contributors to define their own categories. This is fairly simple; they just need a means to add a name or other identifier to an article that represents the category (or categories) in which the article should be placed. Specifying parent-child category relationships is more complicated, but could possibly be done using a special interface. Also, there is a question of what happens if two categories have the same unqualified name (where the qualified name includes the list of the descendents starting at some root category).

If a category scheme is implemented, the software would have to be extended to provide facilities to take advantage of them: to allow users to look at lists of articles which fit in categories, and lists of sub-categories of categories. If these lists are too long or too complicated to read, or if their generation places too great a strain on resources, they will not be usable.

High Priority Requirements

  • Articles are associated with a category somehow.
  • The association between article and category is edited and stored with the article, not with the category. This is usually called a "bottom-up" approach.
  • It's possible to do multiple categories per article.
  • Categories can have one or more sub-categories.
  • Every category except the top ("all", "main", "universal") category must have a parent category. Disagree. A "forest" of category trees is useful, even if there's not a single root. This shouldn't be a requirement. --Evan
It’s counterproductive because it's harder to use. In anycase, you can just make every tree rooted in another tree, and satisfy both requirements easily. Magic! Brent Gulanowski
  • Category pages have links to articles in that category.
  • Category pages have links to sub-categories of that category.
  • Articles have a special link to the category or categories that they belong to.
  • Pages for categories can have some additional text besides the list of articles and sub-categories. For example, an introduction to the subject of the category.
  • Need to be able to categorize the 200K existing articles in English Wikipedia.
  • Categorization happens within the MediaWiki installation; it's not an externally-maintained structure, with pointers to articles.
  • Changing a category's name shouldn't lose the link between articles and categories.
  • A category is defined by the articles in the category (as opposed to the name, which is somewhat arbitrary). A category has zero or more supercategories, zero or more subcategories, zero or more articles directly contained, and some explanatory text.
  • A category ID is associated with the article which summarizes the category. I don't understand what this means. Oh, like they're different database object? Again, none of your beeswax. --Evan
  • A category ID is assigned to the articles in the category. What does that mean? --Evan
Well, I'm just trying to work towards a consistent terminology. I am groping towards a less vague way to describe the article-to-article versus the article-to-category relationships. I'm stuck with using words, but the words are slippery. Please step in and do a better job and I'll be grateful. Brent Gulanowski
  • An article is assigned to a category by associating it with another, already categorized article; if the latter article has more than one category, only one should be chosen; additional categories can be assigned to the first article by associating it with additional pages. (Articles in the same category are categorically associated with one another.) So, this is categorization by prototype? Instead of saying, "This 'sonnet' article is in the category [[poetry]]," I have to say, "This article is like [[villanelle]]"? That's bogus. --Evan
I'm tempted to say "your statement is bogus". Can you be more specific, and, if it's not too much trouble, leave your feelings out of it? If you think its ponderous, OK, that's fair. Brent Gulanowski
  • New categories are created by selecting one or more articles in an existing category and defining a sub-category, defining a partition of the current category. This isn't terribly abstract. I don't see why creating categories can't work exactly like creating article -- make a link to it, and if it doesn't work, follow the link and start editing. --Evan
"Terribly abstract"? Not sure what you're trying to say with that. Categories that are too diluted have little to no value because they do not increase the meaningful complexity of the wiki. If a category is just a keyword, then forget categories and just implement keywords already. Categories are categorically different than keywords. They are inherently more restrictive. That's the point! Brent Gulanowski
  • Articles in a new sub-category will no longer appear in lists of articles in the parent category.
  • Creation of a new category must include the specification of the parent category. Disagree. See above -- a forest of category trees is useful, and we can "splice" them as we go. --Evan
You actually did not explain "above" why you thought it was useful, only the fact that you did. I say it's not useful at all because you end up having to use the search engine anyway. Categories should allow you to browse related articles without using the search engine. Otherwise what is the point? Brent Gulanowski
  • It should be possible to delete a redundant or withered category.
  • There should be a maximum number of articles per category. No. See Zero-One-Infinity Rule. Bogus top limits are bogus. We'll know that a category has reached its top limit when the page doesn't load. --Evan
  • Before new articles can be added to a "full" category, some of its articles must be moved to a (probably new) sub-category. Again, artificial maxes are bogus. --Evan
The word "bogus" is starting to sound bogus. Artificial maximums actually serve a purpose of encouraging editors to think about the structure of the information. I'd be ecstatic to find a better means to do the same thing. Maybe its "un-wiki" to restrict people at all, but I really think that categories are not the same as content and are only useful if some constraints are introduced. If you can stop being so dogmatic about it, maybe we can actually find a middle ground. Brent Gulanowski
  • It should be possible to generate a list of the articles in a particular category.
  • It should be possible to generate a list of sub-categories of a particular category.
  • It should be possible to generate a list of categories to which a given article belongs.
  • Circular loops in the category graph are forbidden -- if sub-categories are define as above (or in a similarly restricted manner) this should be guaranteed.

Too hard to enforce. --Evan

Try this: write a function which traces the category hierarchy as soon as an article is given a new category. It's pretty simple. You can use recursion or iteration. Brent Gulanowski
  • Membership in a category implies that for all articles x in category S, that P(x) for some P which is characteristic of the category (P is implicit but undefined). This is a requirement for using categorization, not for the categorization feature of MediaWiki. It could be better said as, "Don't make stupid categories", or "Don't put things in categories they don't belong in." It's an editors' problem, not a programmers' problem. --Evan
See above on the importance of making the obvious explicit. Hint: not everbody agrees on what is "obvious". Brent Gulanowski
  • It should never be necessary to define a category before writing an article (consequence of this is that the "universal" category cannot have a maximum membership, since all articles that are not explicitly given different categories end up there). Also, since we have 200k articles and zero categories, it'd make categorization useless. We'd have to include a time machine in the implementation. --Evan
Well, the alternative, which some people (hey, it takes all kinds) might even consider sensible, is putting all the existing articles into categories manually as a process of the implementation of a system which violated this "obvious" rule. No time machine required; you just do a lot of offline editing. Or you use some UNIX tool, whatever turns your crank. Brent Gulanowski
  • Categorization should work for pages in all namespaces: Wikipedia:, Image:, Talk:, etc.
  • Categorization shouldn't be mandatory.
Why not? Because it would make people have to think? Well, what an imposition for an encyclopedia contributor, to actually have to think outside of their role as author. Oh, wait, I explained how to have default, non-intrusive categorization with absolutely no imposition on the contributors, but that violates some principle of the wiki, I guess? Brent Gulanowski
  • Categorization shouldn't interfere with namespacing.
  • Categorization shouldn't interfere with subpages.

Nice Things to Have

  • each category has its own maximum number of articles
  • If a "cross-over" (or "hybrid") category is the name of the logical category defined by all of the categories of a certain article, it should be possible (ideally) to see the list of hybrids to which a real category (one with a uniqe ID) belongs (this is perhaps only a variation or extension of the list of articles which are associated with the category).
  • It would be desirable to be able to generate a section of the category graph, say all parent, child, and hybrid categories one, two, or possibly even three edges away.
  • If manually created ordered lists of articles are considered a type of (ad hoc) categorization scheme, a proper categorization scheme should be able to replace these ad hoc lists, possibly by the use of default categorization; existing software for producing, say alphabetical lists of articles should be replaced by standardized procedures in the categorization software.
  • The option to separate category description text (also called "summary article") from membership list into different pages.
  • For articles with more than one category, one of them should be designated the permanent category.
  • Ideally, a page's permanent category should not change, i.e.: the system ID should not change, though the name of the category may change.
  • The list of articles in a category will include (in a distinctive manner) the names of its sub-categories, which stand in for the articles in the sub-categories (membership in a category implies membership in the parent).
  • As a way to avoid orphaned articles, an automatic assignment of a default category based on some simple scheme, such as alphabetical (by title), date added, or the like

Conceptual Problems Which Ought to be Addressed by Any Implementation

  • somehow balance simplicity with usefulness (age-old design issue)
  • allow editors to define categories without excessive hoop-jumping
  • give editors the freedom to define meaningful categories while ensuring they are relevent in the context of existing categories
  • if editors can create "random" or otherwise rootless categories, these should be handled in an elegant (or anyways sensible) manner
  • if categories are assigned IDs or are otherwise coded, avoid requiring the editors to deal with IDs or codes directly
  • coded and plaintext classifications at the same time need to be handled without impeding one or the other
  • the difficulty of providing a meaningful definition of what makes a category useful, and the near impossibility of introducing fair and honest constraints to help ensure such usefulness of added categories
  • the challenge of defining a categorization scheme which is fundamentally different, and more powerful, than some mere keyword meta-tag system, which is the best anyone's offered so far that didn't offend the fragile sensibilities of the wiki religious establishment -- just kidding! Nobody is reading this page anyway! But I wish they would!
  • Creating a category system that people find interesting enough to support

Theoretical Basis

It would be desirable, if not absolutely necessary, to consider some kind of scientific or mathematical model for a categorization scheme. As categories of articles correspond closely to sets of elements, it seems natural to consider and adhere to aspects of set theory when developing a scheme. Because relations between sets can be described using graphs, where vertices are the sets and directed edges are the subset relationship, graph theory should also be useful.

In addition, it is also desirable that any categorization scheme includes an algorithm for converting a group of uncategorized articles (which could be called a set of uncategorized articles) into a categorized group of articles (articles which belong to the categorized set, which is to say that they are members of sets which are in a set called "category"). Such an algorithm can thus be judged on its suitability in light of various qualities: programmability (how well, and to what extent, it can be automated), synchronization, parallelism, recursiveness, and reliability. It is also important whether the algorithm can be performed reliably and naturally by human beings, especially non-specialists, and, moreover, in numbers but working mostly independently.

Algorithms also have the benefit that they can define, implicitly or explicitly, constraints on the way categories are created, how they are named, and the like.

Ease or Possibility of Implementation

An absolute necessity of any scheme is that it can be practically implemented using the technology at hand, that is, without breaking the wikipedia.

See also