University of Amsterdam XML Web CollectionJuly 2010 |
|
The University of Amsterdam XML Web Collection is a collection of publicly available XML documents. It was collected for research purposes on the XML Web.
Please send an e-mail to sgrijzen [at] science.uva.nl regarding any questions.
This page contains the following sections:
The XML Web XML Collection consists of the following downloads:
In addition, collections of XML schema languages have been collected independently:
Used in a study on the Inter-translatability of XML schema languages, the above Referenced Schemas Collection and independent schema collections have been combined into one collection. A description and download are available at:
Additionally, a slightly customized version of libxml2 is available at a separate page:
Furthermore, there are several SPSS data files available:
The files were collected using four steps:
An extensive description of the collection process, including the queries that were used, are available in [1]. Finally, the general statistics of the collections are the following:
| Filetype | Unique URLs in List | Files in Collection | Loss Percentage | Last File Downloaded |
| XML | 188,332 | 180,640 | 4.08% | 2010-07-17 |
| XSD | 8,416 | 8,100 | 3.75% | 30-07-2010 |
| DTD | 8,765 | 8,229 | 6.12% | 31-07-2010 |
| RNG | 8,751 | 8,447 | 3.47% | 30-07-2010 |
| RNC | 816 | 753 | 7.72% | 30-07-2010 |
Unique URLs in list is the number of unique URLs to files of the extension that was collected from Google and Yahoo.
Files in Collection is the actual number of files that could be downloaded.
Loss Percentage is the percentage of URLs that could not be downloaded succesfully.
Last File Downloaded is provided to give an indication of outdatement.
The metadata database contains information about the XML files in the database. It is in SQL format and contains the following relations:
The file relation contains the general information about a file in the XML Web XML Collection.
| Field | Description |
| id (int, key) | This is the unique identifier of a file in the collection. It corresponds with the filename in the collection: <id>.xml |
| URL (text) | This location from which the file was retrieved |
| basedomain (text) | The base domain extracted from the URL. |
| domainextension (text) | The extension of the domain extracted from the URL. E.g. .com or .co.uk |
| calculatedfilesize (int) | The filesize of the document in bytes. Calculated on unix filesystem. |
| filenameextension (text) | The extension of the file extracted from the URL. |
The header relation contains the header properties as found when retrieving the documents.
| Field | Description |
| id (int, key) | This is the unique identifier of a file in the collection. It corresponds with the id in the 'file'-relation and the filename of the file in the collection: <id>.xml |
| author (text) | The 'author' HTTP header. NULL if not found on the file. |
| cache-control (text) | The 'cache-control' HTTP header. NULL if not found on the file. |
| ... .. . |
(all other headers that were found follow the structure of the above two) |
The duplicate_def5 header contains a mapping of duplicates in the XML Web XML Collection. Def5 refers to the definition of equality that was used to determine duplicates. In simple terms, it is determined using a hashing algorithm disregarding white space. For more information please see [1].
If you have a new relation determining duplicates in the collection using a different definition, please contact us as we would appreciate to include it in the collection.
| Field | Description |
| fileid (int, foreign) | The id of the file which has a duplicate |
| duplicate (int, foreign) | The id of the file which is a duplicate of the file specified in 'fileid' |
The encoding table contains information about the specified encodings in the XML file and HTTP header, and if the documents comply with the encodings specified.
| id (int, key) | This is the unique identifier of a file in the collection. It corresponds with the id in the 'file'-relation and the filename of the file in the collection: <id>.xml |
| headerencoding (enum) | The encoding specified in the doctype declaration. NULL if not speficied. |
| headerencodingchecked (boolean) | TRUE if the document's content complies with the value as specified in the headerencoding. |
| pseudoattrencoding (enum) | The encoding specified in the charset parameter of the Content-Type HTTP header. NULL if not speficied. |
| pseudoattrencodingchecked (boolean) | TRUE if the document's content complies with the value as specified in the charset parameter of the Content-Type HTTP header. |
| metatagencoding (enum) | The encoding specified in the meta tag of an XHTML document. NULL if not speficied. |
| metatagencodingchecked (boolean) | TRUE if the document's content complies with the value as specified in the meta tag of an XHTML document. |
The xmllint_errors relations contains the errors that were found when parsing the xml files using a slightly customized version of libxml2.
| Field | Description |
| id (int, key) | Unique identifier of error. |
| fileid (int, foreign) | The id of the file in which the error was found. |
| linenumber (int) | The line number in the file on which the error was found. |
| errordomain (enum) | The domain in which the error occured. For a list of all # possible domains see [1]. |
| errorlevel (enum) | The severity of the error: fatal error, recoverable error, warning or no error. |
| errortype (enum) | The category of error. For a list of all # possible categories see [1]. |
| specificinformation (text) | Any extra information on the error. E.g. attribute names in which the error occured. NULL when not available. |
| entityline (text) | The line of the entity in which the error occured. NULL when not available. |
| element (text) | The line of the entity in which the error occured. NULL when not available. |
The xmllint_validity_errors_dtd relations contains the errors that were found when validating all well-formed XML files that contain a reference to a DTD schema that could be downloaded and compiled. The DTD schemas are available in the XML Web Referenced Schema Collection. A slightly customized version of libxml2 was used.
| Field | Description |
| id (int, key) | Unique identifier of error. |
| schemaid (int, foreign) | The id of the DTD schema that was used when validating. It corresponds with the id in the filename of the file in the referenced schema collection: /dtd/<id>.dtd |
| fileid (int, foreign) | The id of the file in which the error was found. |
| linenumber (int) | The line number in the file on which the error was found. |
| errordomain (enum) | The domain in which the error occured. For a list of all # possible domains see [1]. |
| errorlevel (enum) | The severity of the error: fatal error, recoverable error, warning or no error. |
| errortype (enum) | The category of error. For a list of all # possible categories see [1]. |
| specificinformation (text) | Any extra information on the error. E.g. attribute names in which the error occured. NULL when not available. |
| entityline (text) | The line of the entity in which the error occured. NULL when not available. |
| element (text) | The line of the entity in which the error occured. NULL when not available. |
The xmllint_validity_errors_xsd relations contains the errors that were found when validating all well-formed XML files that contain a reference to an XSD schema that could be downloaded and compiled. The XSD schemas are available in the XML Web Referenced Schema Collection. A slightly customized version of libxml2 was used.
| Field | Description |
| id (int, key) | Unique identifier of error. |
| schemaid (int, foreign) | The id of the XSD schema that was used when validating. It corresponds with the id in the filename of the file in the referenced schema collection: /xsd/<id>.xsd |
| fileid (int, foreign) | The id of the file in which the error was found. |
| linenumber (int) | The line number in the file on which the error was found. |
| errordomain (enum) | The domain in which the error occured. For a list of all # possible domains see [1]. |
| errorlevel (enum) | The severity of the error: fatal error, recoverable error, warning or no error. |
| errortype (enum) | The category of error. For a list of all # possible categories see [1]. |
| specificinformation (text) | Any extra information on the error. E.g. attribute names in which the error occured. NULL when not available. |
| entityline (text) | The line of the entity in which the error occured. NULL when not available. |
| element (text) | The line of the entity in which the error occured. NULL when not available. |
There are two more relations in the database: schema_dtd and schema_xsd. These are described in the section on Referenced Schemas Collection.
This relation is available in version 1.1 and onwards of the UvA XML Web XML Metadata Database.
This relation contains all file id's of XML files that contain a fatal error, and can therefore not be well-formed. The relation is supplied because the original query is computationaly expensive. It can be used as filter in queries on information about files that are well-formed. An example is provided under example queries.
| Field | Description |
| fileid (int, key) | Unique identifier of XML file that contains a fatal error. |
This relation is available in version 1.1 and onwards of the UvA XML Web XML Metadata Database.
This relation contains all file id's of XML files that validate with their referenced DTD. The relation is supplied because the original query is computationaly expensive.
| Field | Description |
| fileid (int, key) | Unique identifier of XML file. |
| schemaid (int, key) | Unique identifier of DTD schema file that is referenced by the XML file. |
This relation is available in version 1.1 and onwards of the UvA XML Web XML Metadata Database.
This relation contains all file id's of XML files that validate with their referenced XSD. The relation is supplied because the original query is computationaly expensive.
| Field | Description |
| id (int, key) | Unique identifier of XML file. |
| schemaid (int, key) | Unique identifier of DTD schema file that is referenced by the XML file. |
The Referenced Schemas Collection contains the DTD and XSD schemas that are referenced in the documents of the XML Web XML Collection.
Note that schemas that are referenced in the XML Web XML Collection, but failed to be downloaded are not included.
All DTDs can be found in /dtd/<id>.dtd.
All includes in the retrieved DTDs are also retrieved recursively. They are available in a folder with the name of the DTD that includes them: /dtd/<id>/include1.dtd. Of course, all includes in the DTD files are automatically modified to include the locally downloaded version of the includes.
The relation 'schema_dtd' in the Metadata Database contains the mapping between the to the locally downloaded DTDs in the XML Web References Schema Collection, and the XML files in the XML Web XML Collection that references them.
The schema_dtd relation contains the references to DTDs that are found in the XML Web XML Collection. It maps it to the locally downloaded DTD in the XML Web References Schema Collection.
| Field | Description |
| id (int, key) | ID of the tupel. |
| schemaid (int, foreign key) | The id of the DTD schema. It corresponds with the id in the filename of the file in the referenced schema collection: /dtd/<id>.dtd |
| fileid (int, foreign key) | This is the unique identifier of a file in the XML Web XML collection, that contains a reference to the DTD specified in 'schemaid' |
| reconstructedurl (text) | The URL from which the DTD was retrieved. It might be reconstructed from a relative url. |
| compiles (enum(true,false)) | TRUE if the schema can be compiled (contains no errors) |
All XSDs can be found in /xsd/<id>.xsd.
All other XSDs and other files that are referenced from inside the retrieved XSDs are also retrieved recursively. They are available in a folder with the name of the DTD: /xsd/<id>/include1.xsd.
The relation 'schema_xsd' in the Metadata Database contains the mapping between the to the locally downloaded XSDs in the XML Web References Schema Collection, and the XML files in the XML Web XML Collection that references them. All includes in the XSD files are altered to point to the locally downloaded version.
The schema_xsd relation contains the references to XSDs that are found in the XML Web XML Collection. It maps it to the locally downloaded XSD in the XML Web References Schema Collection.
| Field | Description |
| id (int, key) | ID of the tupel. |
| schemaid (int, foreign key) | The id of the XSD schema. It corresponds with the id in the filename of the file in the referenced schema collection: /xsd/<id>.dtd |
| fileid (int, foreign key) | This is the unique identifier of a file in the XML Web XML collection, that contains a reference to the XSD specified in 'schemaid' |
| reconstructedurl (text) | The URL from which the XSD was retrieved. It might be reconstructed from a relative url. |
| compiles (enum(true,false)) | TRUE if the schema can be compiled (contains no errors) |
From how many different web sites are the files in the collection downloaded?
SELECT count(distinct basedomain)
FROM `file`
WHERE id NOT IN (SELECT distinct duplicate
FROM duplicate_def5);
96.650 different web sites
What percentage of documents in the collection specify an encoding in the Content-Type HTTP-header?
SELECT count(*) as count,
(count(*) / (SELECT count(*) FROM file) *100) as percentage
FROM encoding
WHERE headerencoding IS NOT NULL;
+-------+------------+ | count | percentage | +-------+------------+ | 42669 | 23.6210 | +-------+------------+ 1 row in set (0.01 sec)
What percentage of DTDs in the collection can be compiled?
SELECT compilestrue,
( compilestrue / ( compilestrue + compilesfalse) ) as percentagetrue,
compilesfalse,
( compilesfalse / ( compilestrue + compilesfalse ) ) as percentagefalse
FROM (SELECT count(distinct schemaid) as compilestrue
FROM schema_dtd
WHERE compiles='true') as t1,
(SELECT count(distinct schemaid) as compilesfalse
FROM schema_dtd
WHERE compiles='false') as t2
+--------------+----------------+---------------+-----------------+ | compilestrue | percentagetrue | compilesfalse | percentagefalse | +--------------+----------------+---------------+-----------------+ | 909 | 0.6611 | 466 | 0.3389 | +--------------+----------------+---------------+-----------------+ 1 row in set (0.05 sec)
66.1% of DTDs can be compiled.
What is the filesize of the largest file that is well-formed?
SELECT max(calculatedfilesize)
FROM file
WHERE id NOT IN (SELECT fileid
FROM convenience_fileswithfatalerror);
716853016 bytes
SPSS was used for correlation and regression analysis. The values were extracted from the XML Web XML Metadata Database.
This file has data of all files that contain a fatal error during well-formedness parsing. It has (1) the id of the file, (2) the amount of fatal errors that were found in the file, (3) the calculated filesize of the file.
This file has data of all files in the collection. It has (1) the id of the file, (2) the calculated filesize of the file, (3) whether or not it contain a fatal error (0=false, 1=true).
This file has data of all files that reference a DTD. It has (1) the id of the file, (2) whether or not the DTD that is referenced can be compiled (0=false, 1=true), (3) whether or not the file contain a fatal error (0=false, 1=true).
This file has data of all files in the collection. It has (1) the id of the file, (2) the domainnam extension (UNKNOWN if not available), (3) the filename extension (UNKNOWN if not available), whether or not the file validates with the schema it references (0=false, 1=true), (5+) the value of all HTTP 1.1 headers (UNKNOWN if not available).