European Research Papers Archive

[Introduction] [The Basics] [General requirements for sites] [Requirements for papers] [Examples] [Glossary]

GUIDELINES FOR SITES AND PAPERS

Version: 5 November 2003
important changes compared to last version in green


Introduction

The ERPA software splits the management of papers included in the series between remote clients (Cologne, Florence, Harvard, Vienna etc.) and a central serving point (Vienna). The central serving point needs only little information about the remote clients (WWW-sites). The below rules were designed to be as scarce and flexible as possible.

For the spiders and the search-engine to function properly, however, it is essential that all sites and papers stick to some (the following) guidelines. Based on this approach, the clients may add papers to their sites – just as they usually do – and the central serving point will be able to include automatically these papers and their specifics (eg. author, title, ...). Those questioning the search-engine will find this particular information exclusively.

The structure of this text is as follows: First, a short overview is given how ERPA works. Then the general requirements for sites are described, and third, the requirements for the papers are defined. The latter are also explained in exemples. Please note that there is a glossary at the end of this text; notions in the text which appear in this glossary are hyperlinked to the glossary section.


The Basics

The basic structure of ERPA is as follows: there is one spider for each participating site which knows its URL, the exact address of the directories etc. (see the general requirements for sites). All papers to be included in the ERPA search-base have to conform to certain requirements for papers. Once a day (probably during the night in Europe), the spiders will look at each site for new or changed papers.

  • Note that the spider verifies the date of change of the so-called main file (see also below). In order to let the spider re-load a paper, the main file has to be re-saved, i.e. given a new date

The spider then stores the data of all papers (i.e. the name of the author, the title etc. and also the full text) in two files per paper at the central serving point. The search form at the ERPA homepage (http://eiop.or.at/erpa/) is generated by the search-engine. This software then searches in the pairs of files for all papers (and not in the remote files at each site).


General requirements for sites

To find papers dedicated for the ERPA (and only those), the spider needs three types of information concerning each site:

  1. First, a list of directories containing the selected papers (the reason for having several directories may be e.g. to have more than one series, or to have a separation between the volumes/years). These directories have to be URLs which must be reachable by WWW and whose content must not be hidden by an index file or otherwise. Note: If parts of papers are dispersed in several directories, the path relative to the main file has to be given in the INCLUDE tag (see below).
    • Example: for the EIoP, there is only one directory to be noted here at the moment:
      http://eiop.or.at/eiop/texte
    • Note that it is not possible to include parts of a paper which are not stored in one of the directories of the participating site. Therefore, if such a "remote" paper is to be included in the ERPA database, a main-file (see just below) has to be created for this paper and stored in the appropriate directory of the site [In this case, the URL of the remote paper would be indicated in the URL-tag – see below.] Note however that by creating a special main file, the remote paper cannot be included in the full text search but only in the author-title-date-keyword search.
  2. Second, a list of search-patterns for the file names in each of those directories. If the paper consists of more than one file, the search-pattern must indicate only one file per paper (called the main file thereafter).
    • Example: in the case of the EIoP, the main files are the abstract files, i.e. those files in the directory given above which end with an "...a.htm", e.g. 1996-001a.htm; by contrast, all other files without "a" just before the dot and the file extension are other parts of a paper and NOT the main file.
      • If adapting old file structures for the purpose of ERPA, it might be a good alternative – instead of altering the whole structure and names – to create a proper directory containing solely the main files which point at the files containing the papers. In this case all files in the directory may be treated as main files. These main files might even not contain any formatted and visible data but only META-tags (see below).
  3. Third, the email address of one person responsible at each site (s/he will be informed automatically by the spider or the central administrator if something goes wrong, e.g. if a wrong keyword was included, see below).

The information above (cf. points 1.–3. above) has to be transmitted to the central administrator who could then include this information in the spider. It is not difficult to change this information if necessary.


Requirements for papers

Papers included in ERPA have to be either in HTML or in PDF format. The following rules apply:

A series which offer the papers in PDF format only has two options:

  1. The standard method is to make the papers available in HTML format AS WELL. This may be done in a separate directory, dedicated to ERPA only. It ist not necessary to put any effort in the layout of these HTML versions of the PDF papers, since they only serve the purpose of delivering the information necessary for ERPA. With the URL-field (see below), it is possible to search in the HTML version of the paper, but to point at the PDF version as the result of a search.
  2. The new option is that the INCLUDE-field (see below) points at one or several PDF files. The disadvantage of this option as compared to option 1 above is that it is not possible to use the TEXT-tags (see below) to delimit the full text. Instead all text extracted from the PDF file will be considered the full text of this paper.

Each paper must consist of one ERPA main file (in HTML format) and zero or several other files. The main file includes the information necessary for ERPA and also guides the spider to the possible other parts (files) of the paper.

The spider searchs for the following list of fields to extract information from the main file. The fields are marked up in the HTML files according to the rules below. Note that some of them (author, title, date) are compulsory and others are not:

  1. author:
    the name of the author(s) of the paper
    • compulsory field
    • there is no convention as to the format of the authors (word order, abbreviations, separation of authors etc.); see the example below
      However, we recommend to be consistent throughout the whole series. That is, stick to one system: with or without abbreviation, same word order, same word or sign before last author etc.
      Names should not be in capital letters only.
  2. title:
    the title of the paper
    • compulsory field
    • Please be aware of the special rules on quotation marks (below)
  3. date:
    date of publication of the paper
    • This field is not compulsory, tough it is highly recommended to indicate the publication date in order to make the date search possible. If this field is not present, the search-engine will not find the respective paper in the date search mode.
    • This field has to be given as follows: D.M.YYYY (D: Day, M: Month, Y: Year), separated by dots. Please use always numbers and no filling blanks or zeros.
    • e.g. 1.3.1998 for 1 March 1998
    • It is however possible to indicate the publication date not as a precise date with day and month but in the form YYYY or M.YYYY. In this case, the spider will automatically amplify the date to 1.1.YYYY or 1.M.YYYY.
  4. URL:
    the URL which the search-engine should point to as the result of a search
    • This field can be used to let the search-engine point to another file than the main file (maybe a PDF-file in a download area or a link with an anchor to a page showing a list of papers).
      • Note that the URL given in this field may also be a remote URL, i.e. not on the server of the site.
    • If this field is not present, the location of the main file itself will be used as the relevant URL.
  5. keywords:
    the list of keywords attributed to the paper
    • If this field is not present, no keywords will be defined and hence be searchable.
    • This field is a list of words or group of words, separated by comma. A group of words must not contain a comma.
    • Only keywords from the keyword-list of the central serving point are allowed. If a new keyword should be added, the central administrator has to be informed.
      (If however an unlisted keyword is included in the main file, the responsible person at the respective site will be informed by email; as long as a new keyword is not included in the list of keywords, it is virtually inexistent which means that it cannot be found via the search-engine).
  6. text:
    the main text of the paper
    • If this field is not present, the spiders will take the whole file as text. However, as soon as it encounters a BEGIN TEXT tag, the search-engine considers only what it finds between these tags.
      NOTE: If the BEGIN TEXT has been used once in one file, e.g. the main file or the file containing the first part, the spiders will not find any other text except the one between the TEXT tags. In other words: the alternative to tag the full text explicitely in one file cannot be combined with the alternative not to the tag the text in another file.
    • This field may be repeated (1) in several files or (2) within one single file, i.e. there might be more than one pair of BEGIN and END tags for one paper (for the markup see below).
    • as to option (1): a paper may consists of several parts, e.g. in order to minimise the time for downloading
    • as to option (2): a one-file-only paper may be marked up in a way to distinguish between, on the one hand, the full text of the paper and the abstract (all marked up as "text"), and, on the other hand, tables, graphs, references, etc. (which will not be marked up as "text").
    • If the spiders encounter only a BEGIN TEXT tag but no END TEXT, the whole text from the BEGIN TEXT tag until the end of the file is considered the text.
    • Note that the search engine uses stopwords.
    • Note that this tag is not available if the INCLUDE tag points to a PDF file (see below).
  7. include:
    a list of one or more files which also include(s) (parts of) the main text
    • If the main-text is not given in the main file or is dispersed in more than one file, the include-field must be used to guide the central serving point to all belonging files.
    • This field is a list of file names separated by commas. In each of these files the main-text should be embraced with COMMENT-tags (see just below).
    • If these files do not reside in the same directory as the main file, the relative (not the absolute) path has to be given in the INCLUDE list for each file (relative to the URL of the main file).
      Example: The main file is at the following URL: http://www.xxx.yyy/series/abstracts/ and the two text files are here: http://www.xxx.yyy/series/texts/. Then the entry in the INCLUDE tag would be:
      <META NAME="include" CONTENT="../texts/papertext1.htm, ../texts/papertext2.htm">.
    • It is possible that the INCLUDE tag points to HTML of PDF files. If a series offers both HTML and PDF versions of a paper, please point to the HTML version in the INCLUDE tag; in any case do not point to both versions as this would mean unnesseary doubling of the full text.
  8. abstract:
    designates the text of the English abstract of the paper.
    • Be careful not to nest TEXT and the ABSTRACT comment tags! While it is certainly possible to have both ABSTRACT and TEXT tags for one paper, the two types of beginnings and end tags should not be mixed: after a <--! BEGIN ABSTRACT --> there must not be a <--! BEGIN TEXT --> tag, but first the <--! END ABSTRACT --> tag.
    • It is allowed to insert the following line break tags inside the abstract text: <BR> or <br> (by contrast, the <P> tag as well as any other tag will be deleted by the spider when uploading the paper).

The following rules apply to this list of fields:

  • The information outlined above (points 1.–8.) has to be included in
    • either META-tags
    • or between <!--BEGIN field--> and <!--END field--> COMMENT-tags.
    with the exception of the text-field which can only be given in the form of COMMENT-tags.
    • These two alternatives may be mixed; if however the same field is given twice, the last occurence will be considered by the spider. The exception to this rule is the text-field which may occur more than once (see above).
    • For URL and include, it seems most appropriate to use the META-tags. Since the rest of the information (author, title, etc.) is most likely already included in the main body of the main file, the second alternative (COMMENT-tags) seems easier to implement because it needs NO DUPLICATION of information in the HTML file.
  • Ideally, each META-tag or COMMENT-tag has to stand at the beginning of a new line and must not be separated by line breaks, i.e. inbetween the start ("<") and the end (">") signs;
    However, the spider now accepts that META- or COMMENT-tags do not stand at the beginning of a line. Either blanks or any other characters are allowed between the beginning of the line and the beginning of the tag.
    Furthermore, the concluding bracket (">") of the META-tag has not to stand in the same line as the beginning of the tag. However, the first part of the META-tag, i.e. everything from the initial bracket ("<") to the last letter of NAME has to be in one line; in other words: no line break within the string "<META NAME". Everything from the sign "=" to the concluding bracket (">") can be in a next line (or several next lines). This will make it much easier to have long lists of keywords or long titles which tend to be broken into more than one line.
    Note that, in contrast to META-tags, COMMENT-tags have to stand in one line, from the initial to the concluding bracket.
    Furthermore:
    • Avoid including HTML markup inbetween the COMMENT-tags. The spider will delete all markup but this is not absolutely perfect (although it has been refined), therefore it is better to put the COMMENT-tags immediately before and after the content; in particular:
    • Avoid including asterix or first footnotes (which refer to, e.g. the author's affillation) or other text (e.g. copyright notices) inbetween the COMMENT- tags.
    • All authors' names have to be in lower cases (except for the beginning of the word) (otherwise this name would not be found when the option "case sensitive" is turned on). The same is recommended for titles.
    • Consequently, it is recommended to use META instead of COMMENT-tags if the formatted paper includes footnotes, copyright notices or upper case words. This means that you would have to repete e.g. the authors' names in the META-tag, but this improves the performance of the search-engine conisderably.
  • Text between the obligatory quotation marks (") in META-tags and between COMMENT-tags is case-sensitive.
  • Text between quotation marks in META-tags must not contain (the same type of) quotation marks, i.e. no " (two strokes), but ' (one stroke) or « or ».
  • No quotation marks are needed between COMMENT-tags; they are, however, allowed. Note: any quotation marks between COMMENT-tags form part of the information included in the field!

Note that a number of special characters are allowed: see the respective list here.


Examples for ERPA markup in the papers

[META-tags] [COMMENT-tags] [mixed solution]

Alternative 1: META-tags

Download the template.htm for a solution
involving only meta tags.

<HEAD>
<META NAME="author" CONTENT="Philippe C. Schmitter and
José I. Torreblanca">



<META NAME="title" CONTENT="Old 'foundations' and new 'rules'
 - For an enlarged European Union">



<META NAME="keywords" CONTENT="institutions, enlargement,
majority voting, Council of Ministers, European Parliament">



<META NAME="date" CONTENT="10.4.1997">
<META NAME="URL" CONTENT="http://eiop.or.at/eiop/texte/1997-001a.htm">
<META NAME="include" CONTENT="1997-001.htm,
1997-001b.htm, 1997-001c.htm">



<META NAME="abstract" CONTENT="This contribution begins with reciting
 the facts behind the resignation of the European Commission under
 Jacques Santer, followed by theoretical considerations on the
 significance of trust and reputation from the principal-agent-theory
 perspective. The third part puts the emphasis on discussing as to
 which extent a loss of trust and reputation had an influence in the
 resignation of the Santer-Commission.<BR>The author concludes that the
 Santer-Commission underestimated the increased power of the European
 Parliament. The inadequate information policy and the increasing
 practice of manipulating documents led to a loss of trust.
 After the threshold had been crossed in connection with the
 BSE-scandal further violations finally led to the destruction
 of reputation of the Santer-Commission.">
</HEAD>

The files 1997-001.htm, 1997-001b.htm, 1997-001c.htm are all stored in the same directory as the main file and include the main text of the paper in three parts; in each of these files the beginning and the end of the text is marked up with the COMMENT-tag for the field "text".


Alternative 2: COMMENT-tags

<BODY>
Publication date:
<!--BEGIN date-->
10.4.1997
<!--END date-->
Some text. This one is written by
<!--BEGIN author-->
Philippe C. Schmitter and José I. Torreblanca
<!--END author-->
and the title is
<!--BEGIN title-->
Old 'foundations' and new 'rules' - For an enlarged European Union
<!--END title--> <!--BEGIN abstract-->
Text of the abstract, several lines, referably one <BR> or two paragraphs.
<!-- END abstract-->
Some introduction not directly belonging to the paper.
<!--BEGIN text-->
The interesting part of the paper: the full text...
<!--END text-->
Maybe some references.
</BODY>

In this example, no keywords are given; the file includes all parts of the text; and the search-engine will point to this file only.


Alternative 3: mixed solution

Example taken from a test version of EIoP paper no. 1997-001: the paper consists of three files: 1997-001a.htm which includes the main information necessary for the search engine; 1997-001.htm which is the file with the main text; 1997-001t.htm which includes the tables and which will not be included in the full text search. The following code is extracted from the 1997-001a.htm file (dots [...] indicate deleted further markup):

<HEAD>
>TITLE>EIoP: Text 1997-001: Abstract</TITLE>
<META NAME="include" CONTENT="../pdf/1997-001.pdf">
<META NAME="URL"
CONTENT="http://eiop.or.at/eiop/texte/1997-001.htm">



</HEAD>
...
<!--BEGIN title-->
Old 'foundations' and new 'rules' - For an enlarged European
Union



<!--END title-->
...
<!--BEGIN author-->
Philippe C. Schmitter and José I. Torreblanca
<!--END author-->
...
Date of publication in the EIoP
<!--BEGINN date-->
10.4.1997
<!--END date-->
<!--BEGIN keywords-->
institutions, enlargement, majority voting, Council of Ministers,
European Parliament



<!--END keywords-->

<!--BEGIN abstract-->
Text of the abstract, several lines, referably one <BR> or two paragraphs.
<!-- END abstract--> ...

In the file 1997-001.htm, the tags <!--BEGIN text--> and <!--END text--> are placed right before the first heading and just before the beginning of the references part (to avoid that words in the titles of cited literature may be found in a full text search).


Glossary

central administrator
the person responsible for maintaining the spiders and the search-engine, the keywords list, and the ERPA-WWW-site; at the moment, this is Michael Nentwich, mnent@oeaw.ac.at
central serving point
short for the ERPA-WWW-site with the search-form, the spiders, the search-engine, and the central disc space for storing the information necessary for the search-engine
field
a number of characters embraced with (either META- or COMMENT-tags) which give a certain type of information such as the name of the author
keyword-list
a list of keywords maintained at the central service point by the central administrator which includes all valid keywords; this list will show in the ERPA search form and is also available online at: http://eiop.or.at/erpa/cgi/ERPAkeywords.txt.
main file
the file containing the searchable information (author, title, etc.); it may not contain the main text of the paper or parts of it
parts of paper
a paper included in ERPA may consist of various parts, one of which is the main file, the others containing parts of the full text of the paper
search-engine
a PERL script installed at the central serving point which is executed via the WWW search form of ERPA
sites
the (remote) WWW servers where the participating working paper series are located
spider
a PERL script, executed periodically at the central serving point, designed to gather the information required for the search-engine; the programme has separate configuration files for each site
stopwords
a list of words which are so common that it would not make sense to include them in the full text search (e.g. "and", "or", "the" etc.); thespider filters the full text in order to exclude all stopwords from the database; the current list of stopwords used in ERPA is at http://eiop.or.at/erpa/cgi/stopwords.txt

©1998-2003 ERPA network; designed by MN