|High Energy Physics Libraries Webzine|
Issue 11 / August 2005
Gallica  is a virtual library which provides free access to approximately 75 000 volumes in all the fields of knowledge from antiquity to the beginning of the twentieth century. The readers have direct access to the texts from home, without needing to handle volumes weakened by time, and sometimes inaccessible in their local libraries. The Gallica library is a great success with a million connections per month. Access to classic titles like the Procès Verbaux de l'Académie des Sciences or the Journal des Scavans is no longer difficult and reserved only for those who have time. It is now possible for researchers to access and read Einstein's articles on the theory of relativity in Annalen der Physik from 1905.
The idea for the large-scale digitizing of France's cultural heritage was born after the creation of the Très Grande Bibliothèque was announced at the end of the 1980s. Initially, the objective was to give users of the future Etablissement Public de la Bibliothèque de France (EPBF) access to around 300,000 reference titles via reading room workstations (Poste de Lecture Assistée par Ordinateur). For budgetary reasons this was reduced to 100,000 titles.
Digitization tests began in 1990. The technology choices had to allow as wide a margin as possible for adaptation and use in a rapidly evolving computing environment, while also conforming to the requirements of a semi-industrial production method. It is out of the question for the BnF to continuously acquire new, improved digitalization equipment and to employ a member of staff to continuously make improvements to the scanned files. The quantities to be digitized are huge and so two people had to be employed.
The first test demonstrated that it was not possible to digitize a bound book page by page at a reasonable cost without risk of damaging it through handling. Moreover, the lengthy procedure involved in the industrial-scale digitization of bound volumes made it impossible to meet the deadlines set at the beginning of the project.
It was therefore decided to digitize either guillotined copies of books (purchased for the purpose) or microforms.
As books constituting part of the national heritage are only very rarely illustrated, it was decided to digitize them as black and white TIFF files in view of this format's widespread use. CCITT Group 4 compression made it possible to provide users with files that could easily be uploaded, without loss of quality. In 1994, the EPBF was merged with the Bibliothèque Nationale (BN) to become the Bibliothèque Nationale de France (BnF), The French National Library . In 1997, the possibility of making the digitized titles available on the Internet was envisaged. The main advantages of this medium are that it enables several readers to consult historical, sometimes rare, works simultaneously, regardless of their location, without having to travel to the Bibliothèque Nationale de France and that it preserves the collection from excessive handling. As TIFF files cannot be consulted on the Internet, the PDF format, which is better suited for the purpose, was proposed.
The need to digitize documents in colour rapidly became apparent. The difference in the size of the files generated through the colour digitization process, in spite of JPEG compression, poses a problem. Furthermore the industrial process for black and white digitization initially did not allow a small number of colour pages to be included. The colour images were therefore digitized by a different service provider and stored on separate servers to those used for the black and white files. Only Gallica had access to both these servers. However, this meant that the whole of a digitized volume in its black and white version plus its colour illustrations had to be consulted in separate files. Since 2003, the service providers have been digitizing the pages irrespective of whether they are black and white or colour. In 2004, the computing system was upgraded to allow readers to consult files containing both the black and white and colour pages.
As soon as the idea emerged of making the relevant titles available online via the Internet, the use of the new medium for the dissemination of information re-opened the discussions on legal problems associated with intellectual property laws. Since 1996, the establishment has observed European law relating to intellectual property rights and since that date only titles that are out-of-copyright are digitized. Broadly speaking, a work is deemed to be out-of-copyright seventy years after the author's death. The Bourbaki works of this century are thus excluded. But the works of Marie Curie, deceased in 1934, which were not prefaced by her daughter Irene: Recherches sur les Substances Radioactives written in 1904, have just been released and we have therefore been able to digitize them.
Following the launch of a Website, a design for the structure of the homepage also had to be developed that was appropriate for the site's target audience. Although an audience consisting of researchers in fields of science and history had a significant influence on the titles selected for inclusion in the corpus, in recent years there has been a clear re-orientation towards what can be called a "broader audience". While the great authors are of interest to students, secondary school pupils and teachers, the numerous works aimed at the popularisation of science and technology are destined for more inquisitive readers.
The homepage's interface allows several modes of access to the digitized titles which reflect the diversity of the site's target audience. The search tool gives access to a catalogue of the titles that can be consulted. It is aimed at those who have already narrowed their search down to a particular author, title or subject. The results of a full-text search provide full bibliographical information as well as access to any tables of contents that might be available. Via this full text search one can, for example, reach directly Jérome Lalande's article: "Observation de l'eclipse de lune du 27 Mars 1755 faite au Luxembourg a Paris", published in Histoire de l'Académie Royale des Sciences avec les Mémoires de Mathématique et de Physique drawn from the register of this Academy in 1755.
In addition to a list of periodicals, the homepage interface also provides direct access to the list of dictionaries available through the Website. Given that dictionaries and periodicals are the most difficult to consult in picture format from a work station, but paradoxically are those most in demand, this mode of access enables regular readers to keep abreast of the latest works accessible online in those categories. A list of the latest works available is posted on the Website each month for "regulars".
The homepage's "Discovery" interface enables users to navigate the Website by theme, with a corpus consisting of introductory texts on specific scientific topics for each broad period of history. For the seventeenth century, for example, it seems appropriate to acquaint the reader with the works of Copernicus, Bruno, Kepler or Galileo by introducing them to the history of astronomy at the time.
The option to navigate by theme appears to be most commonly used by secondary school teachers and their pupils. The information available can be used to supplement teaching materials.
What does Gallica offer on science? What does the corpus consist of?
Physics and mathematics on Gallica
Having determined the "ideal corpus" in collaboration with researchers and using bibliographical references, it was necessary to assess the ease with which the selected titles could be procured and identify their availability on the book market. Among the first sources for the material were the antiquarian book market and purchases from microform publishers.
The direct digitization from documents purchased on the antiquarian market yields better results than from microforms. Moreover, the automation of the digitization process once the documents have been guillotined reduces the cost involved. However, the disadvantage is that purchases are restricted to titles printed during the nineteenth century, as the guillotining of old books printed before the nineteenth century is prohibited. The purchase of microforms is therefore complementary as it allows copies of very old books to be integrated into the corpus (particularly in the fields of botany and medicine and in the case of foreign collections such as the Royal Society's Philosophical Transactions). Unfortunately, the quality of the microforms and consequently the quality of the digitized copies are not always very satisfactory. Another disadvantage of the microforms is that they are all, with a few exceptions, in black and white. The reproduction process therefore results in a loss of quality for any colour plates or illustrations.
The second, and no less important, source of material is through the BnF's partnerships with other libraries. For example, it is through these partnerships that the BnF has been able to digitize the periodicals published by the Académie des Sciences. Twenty per cent of the scientific titles that have been made available online come from these partner libraries.
Since the BnF was established in 1994, it has also been able to draw on the collections of the BN which had been transferred onto microforms with the objective of conserving them or which were available in duplicate following the closure of the Versailles loan centre.
Today, the BnF has moved from the phase of assembling the corpus to that of completing it. This means that the main sources of material are the antiquarian book market (but to a lesser extent) and the BnF's own collections.
The scientific corpus represents around 17% of the Bibliothèque Nationale de France's digitized holdings (around 16,000 volumes) as well as 17% of the out-of-copyright material available for consultation on the Internet (12,000 volumes).
The collection also includes interdisciplinary titles that can be under the headings of philosophy, works of popularisation, scientific literature, rural economics, agronomics, etc.
The statistics show that the number of hits on the Website doubles every year during the months of May and June. It is obvious that the cause for this peak is the university examination period.
For a more accurate assessment of the overall level of consultation of Gallica's scientific titles, it is therefore necessary to refer to "neutral" months such as November and December. User figures show that only an insignificant number of titles remain unconsulted during these two months and therefore that even the latest works available online are consulted, suggesting that Gallica readers keep abreast of the latest additions.
A closer look at the fields of physics and mathematics reveals that the two corpora offer a roughly similar number of documents: 873 volumes on mathematics (518 titles) and 905 volumes on physics (391 titles). The significantly lower number of physics titles can be explained by the number of periodical volumes which is three times higher than that for mathematics. Both corpora also differ in the sense that treatises and studies represent over 77% of the mathematics corpus compared to only 54% for the physics corpus. Nonetheless, both corpora include all the great authors from antiquity to the beginning of the twentieth century, as well as a sample of all the higher education teaching manuals (from the nineteenth century). Both corpora have been assembled in a similar way, mainly through purchases, with a few loans from partner libraries and from the Bibliothèque Nationale de France's own collections. Nevertheless, the number of titles proposed for digitization is far higher for mathematics than for physics. Is it not regularly stated that the ancient mathematics texts remain topical and that they form the basis of the physics of today? This would explain the significant increase in the number of mathematics titles available for consultation on Gallica in the last year. At the end of 2003, there were 130 more physics than mathematics volumes available on Gallica.
This high level of interest in mathematics texts is clearly visible from the user hit statistics for November and December 2004. In fact, only 13 of the mathematics volumes available were not consulted during those two months, compared to 33 for physics. Therefore, not only is the corpus used almost in its entirety, but on average, each mathematics volume is consulted twice as often as a physics title (18,236 hits on average per month compared to 9,268). Over these two months, a mathematics volume was consulted on average 42.5 times compared to 21 for a physics volume. It would be easy to conclude that this difference is due to the higher number of mathematics as opposed to physics treatises and that the general public is more attracted by major monographs than by other types of documents. However, that is not the case, since the periodical the "Journal de Mathématiques Pures et Appliquées" alone is consulted as often as all the physics periodicals put together.
Consequently, if the high number of titles available and high level of consultation of mathematics titles can be explained by the permanent topicality of the texts and by suggestions by mathematics historians and mathematicians, this can only be one factor to which another can be added.
Much of the success can doubtless be attributed to the work of the Mathdoc team, a joint unit of services at the CNRS and UJF (Joseph Fourier University, in Grenoble), which has set up a Mathematics Document Portal . The portal groups together a large number of digitized documents from the field of mathematics whether they be from Gallica, Michigan University Library or Cornell University Library. The research community is therefore aware that by consulting the portal they have access to almost the entire digitized mathematics corpus. In addition, they have also indexed all the periodical articles. Thus, one can search for an author or for a title of an article in the Journal de Mathématiques Pures et Appliquées and immediately view it in its entirety, without cumbersome searches through tables of contents and various Websites. The ease of access and the grouping of digitized mathematics documents thus significantly increases user levels for the titles on Gallica.
The bond between the BnF and Mathdoc was formalized by a convention in 2002. Gallica provides to Mathdoc the records of new, online, mathematics publications; Mathdoc integrates these into its search engine. In addition, Mathdoc has indexed all the articles of the mathematical periodicals digitized by Gallica.
The same is not true in the field of physics. However, the same types of work are consulted in both mathematics and physics. Systematically, in both cases, the titles that are most often consulted (apart from the periodicals) are the great authors in the above-mentioned fields. For physics those heading the list are: Newton, Fourier, Poincaré, Duhem and Arago. For mathematics they are: Cauchy, Molk, Lagrange, Huygens, Bertrand and Abel. In the research field such works are consulted for their scientific content, the very content which has led their authors to be declared "great" at the great bar of history, whereas secondary school teachers and their pupils tend to consult these works because of their authors' celebrity. It is these texts that are the basis for the history of science and, being targeted at the widest possible audience, can therefore be navigated by theme.
Future technical and scientific developments
Today, the development of Gallica is focused on completing the existing corpus and on the modes of access to it. However, the creation of numerous digitized science corpora, and developments in the field of technology have prompted a redefinition of Gallica's format and its content.
A large number of institutions have launched projects to make texts available online. In doing so, they choose either to showcase their own documentary collections, to mark a particular science event or even, following the lead of the Mathdoc team, to set up portals compiling specialist documentation. Resources are therefore required to provide readers with a direct overview of the full digitized science corpus. The BnF is moving towards the OAI (Open Archives Initiative) solution. In due course, the solution will enable any library equipped with the protocol seeking to supplement its corpus, while avoiding duplicating the digitization process, to harvest the bibliographical references for the titles available on Gallica and to immediately refer the reader to that digitized title. Today, the first bibliographical references for Gallica's simple monographs have been provided using Dublin Core metadata so that Gallica can become a source for the dissemination of data. The first trial is underway. Finally, in due course the BnF hopes to refer Gallica readers to the relevant titles digitized by other institutions and available on their own Websites. The BnF hopes to test the Gallica data harvester by the end of 2006.
Moreover, co-operation with the Mathdoc team has led to progress in solving the problem of the structured data entry of tables of contents of periodicals, with a view to enabling direct access to an article from an author or title search. The possibility is therefore also being studied of making the information from the tables of contents of periodicals available to the OAI. An enormous digitized library is emerging, incorporating all the large digitized corpora and providing the navigation tools required to make it substantially easier for readers to consult documents that are difficult to access, such as periodicals, dictionaries and complete works.