The University of Amsterdam XML Web Collection is a collection of publicly available XML documents. It was collected for research purposes on the XML Web.

Please send an e-mail to sgrijzen [at] science.uva.nl regarding any questions.

This page contains the following sections:

  1. Downloads
  2. Collection Process and General Statistics
  3. XML Metadata Database
  4. Referenced Schemas Collection
  5. Example Queries
  6. SPSS Data Files
  7. Changelog
  8. References

Downloads

The XML Web XML Collection consists of the following downloads:

  1. UvA XML Web XML Collection v1.0 (warning: 4.1 GB)
  2. UvA XML Web XML Metadata Database v1.1 (warning: 155 MB)
  3. UvA XML Web XML Referenced Schemas Collection v1.0 (17 MB)

In addition, collections of XML schema languages have been collected independently:

Used in a study on the Inter-translatability of XML schema languages, the above Referenced Schemas Collection and independent schema collections have been combined into one collection. A description and download are available at:

Additionally, a slightly customized version of libxml2 is available at a separate page:

Furthermore, there are several SPSS data files available:


Collection Process and General Statistics

The files were collected using four steps:

  1. Crawling a list of URLs of XML documents from Yahoo and Google,
  2. Downloading the content of each URL,
  3. Organizing the collection,
  4. Determining duplicates.

An extensive description of the collection process, including the queries that were used, are available in [1]. Finally, the general statistics of the collections are the following:

Filetype Unique URLs in List Files in Collection Loss Percentage Last File Downloaded
XML 188,332 180,640 4.08% 2010-07-17
XSD 8,416 8,100 3.75% 30-07-2010
DTD 8,765 8,229 6.12% 31-07-2010
RNG 8,751 8,447 3.47% 30-07-2010
RNC 816 753 7.72% 30-07-2010

Unique URLs in list is the number of unique URLs to files of the extension that was collected from Google and Yahoo.
Files in Collection is the actual number of files that could be downloaded.
Loss Percentage is the percentage of URLs that could not be downloaded succesfully.
Last File Downloaded is provided to give an indication of outdatement.


XML Metadata Database

The metadata database contains information about the XML files in the database. It is in SQL format and contains the following relations:

File

The file relation contains the general information about a file in the XML Web XML Collection.

Field Description
id (int, key) This is the unique identifier of a file in the collection. It corresponds with the filename in the collection: <id>.xml
URL (text) This location from which the file was retrieved
basedomain (text) The base domain extracted from the URL.
domainextension (text) The extension of the domain extracted from the URL. E.g. .com or .co.uk
calculatedfilesize (int) The filesize of the document in bytes. Calculated on unix filesystem.
filenameextension (text) The extension of the file extracted from the URL.

Header

The header relation contains the header properties as found when retrieving the documents.

Field Description
id (int, key) This is the unique identifier of a file in the collection. It corresponds with the id in the 'file'-relation and the filename of the file in the collection: <id>.xml
author (text) The 'author' HTTP header. NULL if not found on the file.
cache-control (text) The 'cache-control' HTTP header. NULL if not found on the file.
...
..
.
(all other headers that were found follow the structure of the above two)

Duplicate_def5

The duplicate_def5 header contains a mapping of duplicates in the XML Web XML Collection. Def5 refers to the definition of equality that was used to determine duplicates. In simple terms, it is determined using a hashing algorithm disregarding white space. For more information please see [1].

If you have a new relation determining duplicates in the collection using a different definition, please contact us as we would appreciate to include it in the collection.

Field Description
fileid (int, foreign) The id of the file which has a duplicate
duplicate (int, foreign) The id of the file which is a duplicate of the file specified in 'fileid'

Encoding

The encoding table contains information about the specified encodings in the XML file and HTTP header, and if the documents comply with the encodings specified.

id (int, key) This is the unique identifier of a file in the collection. It corresponds with the id in the 'file'-relation and the filename of the file in the collection: <id>.xml
headerencoding (enum) The encoding specified in the doctype declaration. NULL if not speficied.
headerencodingchecked (boolean) TRUE if the document's content complies with the value as specified in the headerencoding.
pseudoattrencoding (enum) The encoding specified in the charset parameter of the Content-Type HTTP header. NULL if not speficied.
pseudoattrencodingchecked (boolean) TRUE if the document's content complies with the value as specified in the charset parameter of the Content-Type HTTP header.
metatagencoding (enum) The encoding specified in the meta tag of an XHTML document. NULL if not speficied.
metatagencodingchecked (boolean) TRUE if the document's content complies with the value as specified in the meta tag of an XHTML document.

Xmllint_wellformedness_errors

The xmllint_errors relations contains the errors that were found when parsing the xml files using a slightly customized version of libxml2.

Field Description
id (int, key) Unique identifier of error.
fileid (int, foreign) The id of the file in which the error was found.
linenumber (int) The line number in the file on which the error was found.
errordomain (enum) The domain in which the error occured. For a list of all # possible domains see [1].
errorlevel (enum) The severity of the error: fatal error, recoverable error, warning or no error.
errortype (enum) The category of error. For a list of all # possible categories see [1].
specificinformation (text) Any extra information on the error. E.g. attribute names in which the error occured. NULL when not available.
entityline (text) The line of the entity in which the error occured. NULL when not available.
element (text) The line of the entity in which the error occured. NULL when not available.

Xmllint_validity_errors_dtd

The xmllint_validity_errors_dtd relations contains the errors that were found when validating all well-formed XML files that contain a reference to a DTD schema that could be downloaded and compiled. The DTD schemas are available in the XML Web Referenced Schema Collection. A slightly customized version of libxml2 was used.

Field Description
id (int, key) Unique identifier of error.
schemaid (int, foreign) The id of the DTD schema that was used when validating. It corresponds with the id in the filename of the file in the referenced schema collection: /dtd/<id>.dtd
fileid (int, foreign) The id of the file in which the error was found.
linenumber (int) The line number in the file on which the error was found.
errordomain (enum) The domain in which the error occured. For a list of all # possible domains see [1].
errorlevel (enum) The severity of the error: fatal error, recoverable error, warning or no error.
errortype (enum) The category of error. For a list of all # possible categories see [1].
specificinformation (text) Any extra information on the error. E.g. attribute names in which the error occured. NULL when not available.
entityline (text) The line of the entity in which the error occured. NULL when not available.
element (text) The line of the entity in which the error occured. NULL when not available.

Xmllint_validity_errors_xsd

The xmllint_validity_errors_xsd relations contains the errors that were found when validating all well-formed XML files that contain a reference to an XSD schema that could be downloaded and compiled. The XSD schemas are available in the XML Web Referenced Schema Collection. A slightly customized version of libxml2 was used.

Field Description
id (int, key) Unique identifier of error.
schemaid (int, foreign) The id of the XSD schema that was used when validating. It corresponds with the id in the filename of the file in the referenced schema collection: /xsd/<id>.xsd
fileid (int, foreign) The id of the file in which the error was found.
linenumber (int) The line number in the file on which the error was found.
errordomain (enum) The domain in which the error occured. For a list of all # possible domains see [1].
errorlevel (enum) The severity of the error: fatal error, recoverable error, warning or no error.
errortype (enum) The category of error. For a list of all # possible categories see [1].
specificinformation (text) Any extra information on the error. E.g. attribute names in which the error occured. NULL when not available.
entityline (text) The line of the entity in which the error occured. NULL when not available.
element (text) The line of the entity in which the error occured. NULL when not available.

There are two more relations in the database: schema_dtd and schema_xsd. These are described in the section on Referenced Schemas Collection.

Convenience_fileswithfatalerror

This relation is available in version 1.1 and onwards of the UvA XML Web XML Metadata Database.

This relation contains all file id's of XML files that contain a fatal error, and can therefore not be well-formed. The relation is supplied because the original query is computationaly expensive. It can be used as filter in queries on information about files that are well-formed. An example is provided under example queries.

Field Description
fileid (int, key) Unique identifier of XML file that contains a fatal error.

Convenience_filesthatvalidatewithdtd

This relation is available in version 1.1 and onwards of the UvA XML Web XML Metadata Database.

This relation contains all file id's of XML files that validate with their referenced DTD. The relation is supplied because the original query is computationaly expensive.

Field Description
fileid (int, key) Unique identifier of XML file.
schemaid (int, key) Unique identifier of DTD schema file that is referenced by the XML file.

Convenience_filesthatvalidatewithxsd

This relation is available in version 1.1 and onwards of the UvA XML Web XML Metadata Database.

This relation contains all file id's of XML files that validate with their referenced XSD. The relation is supplied because the original query is computationaly expensive.

Field Description
id (int, key) Unique identifier of XML file.
schemaid (int, key) Unique identifier of DTD schema file that is referenced by the XML file.

Referenced Schemas Collection

The Referenced Schemas Collection contains the DTD and XSD schemas that are referenced in the documents of the XML Web XML Collection.

Note that schemas that are referenced in the XML Web XML Collection, but failed to be downloaded are not included.

DTDs

All DTDs can be found in /dtd/<id>.dtd.

All includes in the retrieved DTDs are also retrieved recursively. They are available in a folder with the name of the DTD that includes them: /dtd/<id>/include1.dtd. Of course, all includes in the DTD files are automatically modified to include the locally downloaded version of the includes.

The relation 'schema_dtd' in the Metadata Database contains the mapping between the to the locally downloaded DTDs in the XML Web References Schema Collection, and the XML files in the XML Web XML Collection that references them.

Schema_dtd

The schema_dtd relation contains the references to DTDs that are found in the XML Web XML Collection. It maps it to the locally downloaded DTD in the XML Web References Schema Collection.

Field Description
id (int, key) ID of the tupel.
schemaid (int, foreign key) The id of the DTD schema. It corresponds with the id in the filename of the file in the referenced schema collection: /dtd/<id>.dtd
fileid (int, foreign key) This is the unique identifier of a file in the XML Web XML collection, that contains a reference to the DTD specified in 'schemaid'
reconstructedurl (text) The URL from which the DTD was retrieved. It might be reconstructed from a relative url.
compiles (enum(true,false)) TRUE if the schema can be compiled (contains no errors)

XML Schemas (XSD)

All XSDs can be found in /xsd/<id>.xsd.

All other XSDs and other files that are referenced from inside the retrieved XSDs are also retrieved recursively. They are available in a folder with the name of the DTD: /xsd/<id>/include1.xsd.

The relation 'schema_xsd' in the Metadata Database contains the mapping between the to the locally downloaded XSDs in the XML Web References Schema Collection, and the XML files in the XML Web XML Collection that references them. All includes in the XSD files are altered to point to the locally downloaded version.

Schema_xsd

The schema_xsd relation contains the references to XSDs that are found in the XML Web XML Collection. It maps it to the locally downloaded XSD in the XML Web References Schema Collection.

Field Description
id (int, key) ID of the tupel.
schemaid (int, foreign key) The id of the XSD schema. It corresponds with the id in the filename of the file in the referenced schema collection: /xsd/<id>.dtd
fileid (int, foreign key) This is the unique identifier of a file in the XML Web XML collection, that contains a reference to the XSD specified in 'schemaid'
reconstructedurl (text) The URL from which the XSD was retrieved. It might be reconstructed from a relative url.
compiles (enum(true,false)) TRUE if the schema can be compiled (contains no errors)

Example Queries

Question:

From how many different web sites are the files in the collection downloaded?

Query:

SELECT count(distinct basedomain)
FROM `file` 
WHERE id NOT IN (SELECT distinct duplicate 
                 FROM duplicate_def5);

Answer:

96.650 different web sites

Question:

What percentage of documents in the collection specify an encoding in the Content-Type HTTP-header?

Query:

SELECT count(*) as count, 
      (count(*) / (SELECT count(*) FROM file) *100) as percentage 
FROM encoding 
WHERE headerencoding IS NOT NULL;

Answer:

+-------+------------+
| count | percentage |
+-------+------------+
| 42669 |    23.6210 | 
+-------+------------+
1 row in set (0.01 sec)

Question:

What percentage of DTDs in the collection can be compiled?

Query:

SELECT compilestrue,
       ( compilestrue / ( compilestrue + compilesfalse) ) as percentagetrue,
       compilesfalse,
       ( compilesfalse / ( compilestrue + compilesfalse ) ) as percentagefalse 
FROM (SELECT count(distinct schemaid) as compilestrue 
      FROM schema_dtd 
      WHERE compiles='true') as t1,
     (SELECT count(distinct schemaid) as compilesfalse 
      FROM schema_dtd 
      WHERE compiles='false') as t2

Answer:

+--------------+----------------+---------------+-----------------+
| compilestrue | percentagetrue | compilesfalse | percentagefalse |
+--------------+----------------+---------------+-----------------+
|          909 |         0.6611 |           466 |          0.3389 | 
+--------------+----------------+---------------+-----------------+
1 row in set (0.05 sec)

66.1% of DTDs can be compiled.

Question:

What is the filesize of the largest file that is well-formed?

Query:

SELECT max(calculatedfilesize) 
FROM file 
WHERE id NOT IN (SELECT fileid 
                 FROM convenience_fileswithfatalerror);

Answer:

716853016 bytes


SPSS Data Files

SPSS was used for correlation and regression analysis. The values were extracted from the XML Web XML Metadata Database.

This file has data of all files that contain a fatal error during well-formedness parsing. It has (1) the id of the file, (2) the amount of fatal errors that were found in the file, (3) the calculated filesize of the file.

This file has data of all files in the collection. It has (1) the id of the file, (2) the calculated filesize of the file, (3) whether or not it contain a fatal error (0=false, 1=true).

This file has data of all files that reference a DTD. It has (1) the id of the file, (2) whether or not the DTD that is referenced can be compiled (0=false, 1=true), (3) whether or not the file contain a fatal error (0=false, 1=true).

This file has data of all files in the collection. It has (1) the id of the file, (2) the domainnam extension (UNKNOWN if not available), (3) the filename extension (UNKNOWN if not available), whether or not the file validates with the schema it references (0=false, 1=true), (5+) the value of all HTTP 1.1 headers (UNKNOWN if not available).


Changelog


References

  1. Steven Grijzenhout (2010). Quality of the XML Web. Master thesis. University of Amsterdam, Amsterdam, The Netherlands. (1.7 MB)