Text corpus: Difference between revisions

Revision as of 00:47, 25 December 2003

A text corpus is a body of text following certain standards, e.g. markup standards such as a wikitext standard or an XML DTD, or even just using a common glossary. The entire World Wide Web can be thought of as a single text corpus divided by languages, and by documentation licenses. There are tools specific to large text corpus analysis that are increasingly available for public use.

The public domain text corpus is quite large including all the US federal government material, Project Gutenberg, and almost anything published prior to the 20th century. Most of this is actually visible and searchable now.

There is a lot of material under ambiguous copyright in Netnews posted over the 1980s and 1990s. Most of it is copyright the original author, but some is in the public domain explicitly.

The GFDL text corpus is also quite large and somewhat more robust, although under the influence of several enemy projects it is becoming less robust, and more likely to reflect the viewpoints of those projects, and not those who actually understand the subjects. Contributions to the R&D wiki are part of this corpus, but, that does not mean that other contributions to the GFDL corpus, notably not those to Wikipedia or Disinfopedia, should be taken as equivalently credible to those that are made via the Consumerium interface. As it stands, this is the most credible of the various mediawiki front ends as it has the least autocratic and overtly politically biased editorial policy - most likely due to its value system which promotes transparency and accountability, as opposed to being a "hobby" or "lobbying" project. In time, the standards of the GFDL text corpus will likely be set here, not in the larger/other projects, which will become less trustworthy over time.

The Creative Commons text corpus is also quite large, and some think it may eventually eclipse the GFDL - if so it is the editorial standards of the CC, not those of the GFDL, that should succeed, as the latter are rigorously defined by Lawrence Lessig and others with some integrity that the operators of GFDL editing sites simply cannot claim.