RETHINKING SUBJECT CATALOGING
IN THE ONLINE ENVIRONMENT
Marcia J. Bates
Graduate School of Library and Information Science
University of California, Los Angeles
Copyright © 1988
by Marcia J. Bates
The new search capabilities in online catalogs have numerous implications for 1) the use of subject cataloging in existing records, 2) the design of thesauri, and 3) the design of the online catalog user-system interface. All three areas are discussed. Online search capabilities are themselves seen as a form of indexing, and it is argued that access is determined by the total mix of pre-existing and added "search capability" indexing. The impact of title keyword searching on retrieval is discussed in some detail. The design of a "Superthesaurus" as a part of a friendly front-end user interface is described. Said thesaurus is geared to the needs of users, rather than indexers, and incorporates the findings of recent research on the patterns of description of subject by searchers. Its design also reflects the different demands of online searching as opposed to manual searching.
We in the library field recognize that the introduction of online catalogs into libraries opens up impressive new possibilities of power of retrieval and ease of use for ourselves and our clients.1,2 Our task now is to design the intellectual content and arrangement of catalogs so as to take maximum advantage of these new technical capabilities.
In order to do that, we need to understand the interaction between the database--the bibliographic records--and the search capabilities of the online system. There are many ramifications, some obvious and others quite subtle, of this interaction. It is the purpose of this article to develop these ramifications in three areas relating to subject cataloging:
• Role of present subject cataloging in existing records
• Subject headings/thesaurus design
• System-user interface design
Throughout the discussion, emphasis will be on improving user access and retrieval effectiveness. Below, each of these three areas is discussed in turn.
ROLE OF PRESENT SUBJECT CATALOGING IN RECORDS
A note on terminology: Throughout this article, the terms "indexing" and "subject cataloging" are used interchangeably. Similarly, "thesaurus" is used to refer both to Library of Congress Subject Headings (ordinarily called a subject heading list), as well as to other term lists more conventionally referred to as thesauri. The distinctions between subject cataloging and indexing as processes, and between subject heading lists and thesauri are important more for historical reasons than present practice. Though differences may still be discerned, so many changes in thinking about these are necessitated by the new circumstances associated with the online environment, that traditional distinctions are largely meaningless anyway.
Based on our experience with card catalogs, we have been conditioned to think of the subject indexing in a catalog as consisting of subject headings and classification numbers. When a card catalog is put online, it is natural, then, to continue this assumption--to think of the indexing in an online catalog as consisting of the same subject elements that were already present in the database before it was automated. We think of an online catalog as simply the same catalog we had before, but now online-accessible.
But, in fact, online search capabilities themselves constitute a form of indexing.
Subject access to online catalogs is thus a combination of original indexing and what we might call "search capabilities indexing" (more on this presently). The interactions between these two kinds of indexing can be subtle. A range of search capabilities are superimposed on textual materials of various kinds in the bibliographic record, including natural and controlled languages. The resulting interaction has a variety of effects on the quality of searches done by users in online catalogs. The effects are not only additive. Sometimes they cancel each other out or are so synergistic as to be multiplicative. We will consider some of these effects in this section, and draw implications for how to handle searching of documents already present in the catalog database.
Typical online search capabilities are keyword searching, Boolean searching, truncation, and multi-index searching (that is, being able to combine query terms from more that one index, e.g., "FIND TITLE Grapes AND FIND AUTHOR Steinbeck").
To illustrate my point that search capabilities actually produce a kind of indexing, let us take keyword searching as an example. Let us imagine two online catalogs, one with the ability to search for exact matches on entire subject headings and titles, and the other with that same capability plus keyword searching of titles. (We will ignore author access in this discussion.)
In the case of the subject heading plus keyword title catalog, every word (except stop words) in the title of each record in the catalog has now become an instant index term. The equivalent in a manual catalog would be to provide an index that listed every non-stop-word in the title of each book as an entry. (This approach is in fact quite close to a "Key Word in Context" index.) But manual catalogs conventionally do not provide such title keyword indexes. So this online capability adds a major new form of indexing to the subject access.
Such title keyword terms have the important limitation of being uncontrolled vocabulary, but they have the advantage of providing subject access of another type not previously available. So when an online catalog simply possesses the capability
of being searched by title keyword, we, in effect, have a whole new index to the catalog, with every title word an index term, even though we nowhere see that whole index printed out.
If the addition of uncontrolled title keyword terms seems like a trivial or dubious improvement in subject access, consider the results of a study by Ann Schabas. She did an extensive experimental study on retrieval of UK MARC records in a selective dissemination of information system in Canada.3 Users of the system were asked to give relevance judgments on over 5000 retrieved book records. The records contained both LCSH and PRECIS indexing, and searching could be done on title words as well. Consequently, she was able to make the following retrieval performance comparisons:
• LCSH vs. PRECIS
• LCSH vs. LCSH plus title words
• PRECIS vs. PRECIS plus title words
• LCSH plus title words vs. PRECIS plus title words
Searching was done postcoordinately, that is, with Boolean logic on individual words and phrases, not just on whole subject headings or titles. She found only modest differences in performance between LCSH and PRECIS, but substantial improvement in both systems when title terms were included. For example, using recall (percentage of relevant documents retrieved) and precision (percentage of retrieved documents that are relevant) measures, she found in the LCSH/PRECIS comparison that PRECIS had 5.7 percent better recall than LCSH, and LCSH had 1.7 percent better precision than PRECIS--very similar performances. On the other hand, in the LCSH vs. LCSH plus title comparison, addition of title words to LCSH improved recall by 14.7 percent, and with PRECIS vs. PRECIS plus title, by 11.1 percent.4 It is often the case that techniques used to improve recall are found to harm precision, and vice versa; that is, there seems to be a stubborn trade-off between the two measures, so that it is hard to improve overall system performance. In this case, there were no such problems. These substantial improvements in recall with the addition of title terms brought only slight declines in precision--2.9 and 2.2 percent, respectively.5
Schabas' results are not completely comparable to title keyword searching, because the system she used permitted "string searching," that is, searching on strings of characters, including blanks--thus permitting search on phrases as well as single words. In most online catalogs with keyword searching, a searcher can ordinarily express a phrase only by use of a Boolean AND between the words of the phrase, either implicitly or explicitly. Thus false coordinations can occur when the ANDed terms appear in a different order than intended, or separated by other words. However, given that titles are short pieces of text, serious problems with false drops are unlikely to occur often. Schabas' results suggest that some real improvement in retrieval performance may be expected through use of title terms.
It can be seen from these results that the presence of a title keyword search capability produces a retrieval system that is significantly different in character from one without such a capability. We can search in different ways--specifically, on all title words, not just on first word in title, and we can use that title search feature in conjunction with other capabilities--and we get very different performance results. Thus, the availability of additional search capabilities does not just give us more bells and whistles to play with; rather each search capability creates a significantly different retrieval system with its own performance characteristics.
The other major result of Schabas' study noted above, namely, the similarity in performance of LCSH and PRECIS, sheds light on another point made earlier--that the combination of pre-existing indexing and search capability indexing is not always simply additive. One of the major characteristics of PRECIS that recommends it as a possible improvement over LCSH is its well-worked out set of techniques to arrange the subject elements of the PRECIS heading in different orders. These techniques bring each element to the head of successive entries while preserving the semantic context of the whole heading. We might, therefore, expect PRECIS to perform very significantly better than LCSH. When each significant word in the title is accessed through keyword matching anyway, however, this great strength of PRECIS is, in effect, ignored, or not needed, by the system, and the corresponding weakness of LCSH in manual systems (entry only through first word in title) is overcome.
There is another very important implication of these results. PRECIS, with its rigorous analytical techniques, has been widely discussed as a prime candidate for replacing LCSH and therefore improving subject access in academic libraries.6,7 Yet with the advent of these online search capabilities, and the particular mix of effects these features have on the use of PRECIS and LCSH, it now appears that LCSH performs essentially as well as PRECIS. We are saved from the enormous cost of converting to PRECIS for reasons of performance, not just financial limitations.
We can see from this example that pre-existing indexing and search capability indexing can interact in ways that produce surprising results. To understand actual performance characteristics of online catalogs, one must analyze each search feature that is projected or in use, and determine its interaction with existing indexing.
The evaluation of search capability indexing can become quite complex. We have looked at only one form, title keyword searching. Not only are there the other major classes of search features described above (plus others not mentioned), but there are also numerous small variations on each type. These variations may alter the character of a given type of search capability considerably. For example, in systems that require exact matches on subject headings, the punctuation may or may not be regarded, main headings plus subdivisions may be treated as one heading or as separate headings, and truncation on the right end of the heading may or may not be allowed, among others. Finally, the total combined set of search features, each with small variations, make a unique mix for each different online catalog system. In sum, with online catalogs, the fact that several different systems may be using the same bibliographic data with associated subject headings, says little about the actual subject access available to searchers in those systems.
Title keyword access illustrates another point about how our thinking must shift in the new online environment. It is natural to think of titles as part of descriptive rather than subject cataloging, so when we consider use of title terms in online catalogs, it is easy to think of title keyword searches as "known-item" searches, as opposed to subject searches. In a card catalog, if searchers wanted something on a subject and looked in the author/title catalog, they were usually making a basic mistake.
But, as we have seen above, in online catalogs, title keyword searching can constitute a powerful kind of subject searching. Keyword matching with one or two title words--either words from a known title, or just fishing--can often produce a number of highly relevant titles. Many users are content with just a few items, so the comprehensiveness of a search on the controlled vocabulary of a subject heading is of little interest to them. Thus, such title keyword searching is not only an effective form of subject searching, but with some kinds of user needs, is arguably even preferable. So it no longer seems appropriate to think of title searches as necessarily being for known items.
This difference in thinking about title searching that is necessitated by the shift from card to online catalogs may play a role in the ongoing debate about the relative importance of subject and known-item searching in online catalogs. Early results of online catalog studies showed a substantial rise in use of the subject approach over that found in previous card catalog studies. Reviewing 41 card catalog use studies, Markey found that an average of 40 percent of the usage was for subject searches.8 On the other hand, the Council on Library Resources study, published in 1983, found that subject uses constituted fully 59 percent of all online catalog uses across many types of libraries and online systems, hence a substantial increase in subject searching for online catalogs.9 More recent data gathered by Ray Larson in a large study of the University of California MELVYL system, however, is showing a considerable reduction from the initial high level--down to the 20-30 percent range--in the amount of subject index use. The difference is being made up by extra title searching.10
These results have received some discussion the field to the effect that the high use of subject was just a temporary phenomenon reflecting interest in a new toy, and that known-item searches are now returning to their original pre-eminence.
Larson and I both support a different explanation for these results, however. We would suggest that the experience of problems with subject searching on the one hand, and easy retrieval of at least some items with title terms on the other hand, is leading many end users to shift at least part of their subject searching from subject headings to titles--hence the falloff in subject index use.
There have always been difficulties in matching search terms with subject headings.11,12 In the online environment some serious problems with subject searching remain. The same low number of headings exist in online document records as in manual, and cross references--available in card catalogs--are often not mounted in online catalogs. If this explanation is correct, then the falloff in subject index use is due to problems with subject indexes, not to lack of interest in subject searching.
SUBJECT HEADINGS/THESAURUS DESIGN
We have seen in the previous section that evaluating the subject access available in an online catalog requires analysis of subject cataloging in combination with online search capabilities. There are many specific ramifications of such an analysis for authority control, for making specific decisions about particular subject headings and cross references.
Here, however, I wish to address some broader issues, ones having to do with the overall design and character of subject headings and thesauri in the online environment. The specific decisions of authority control need to be made within the context of a different conception of subject heading access now that we have online capabilities. We now have the opportunity to greatly improve the power and ease of catalog use for searchers, but to do that we must make more effective use of the capabilities provided by the online technology.
Catalog users could always have benefited from more assistance than we provided them in determining which subject heading to use, but subject indexing has been constrained in the card catalog by space and staff limitations. Now there are not only more possibilities in providing access in the online environment, but there are reasons why certain kinds of assistance are more essential with online catalogs.
The first reason is that for the searcher the online catalog is a kind of "black box." That is, one cannot look inside it and see what is there, the way one can look into a card catalog drawer. One has to tell the system something in order to get anything out. In many online catalogs, the requirements for subject searching are that the searcher must state an exactly correct Library of Congress subject heading in order to retrieve anything. One cannot get part of the heading right and then fish around in the same area of the drawer, as with card catalogs. Yet it is in online catalogs that see and see also references have typically been added later or not at all. So in the online environment the user particularly needs help in identifying good headings.
Secondly, the presence of implicit and explicit Boolean logic in online catalogs allows the searcher to do more combining of subject elements than was the case formerly (see also Bates13). Traditionally, catalogers have assembled main headings and subdivisions to create the final subject description. However, now, online catalogs that allow searchers to treat main headings and subdivisions separately, and which allow searching on keywords, enable searchers to assemble elements of subject search formulations themselves to meet search needs. In this environment, searchers need help both with strategy and with identifying elements to combine.
Third, in addition to having more combinatorial choices of subject elements, the searcher also has more options with search formulation and modification: choice of different indexes and combinations of indexes to search, truncation, limitation by file (e.g., limiting to serials), etc. Frequently, the searcher needs to use these techniques to increase or decrease the size of output sets. In sum, with more powerful searching possibilities, the searcher needs more powerful assistance.
There are several points that I wish to make in this and the next section about the kinds of assistance we can and should give to users of online catalogs. For this section, on thesaurus design, the following summary statement can be made: The searcher should be provided a user thesaurus (in contradistinction to an indexer thesaurus), incorporating vocabulary for online search features such as keyword searching and Boolean logic. Let us take each part of this sentence and discuss it in turn below. Throughout, the assumption is that online access to the thesaurus will be provided, though many of these features could be incorporated into a manual thesaurus as well. In the next section we will discuss ways to integrate these and other features into the user-system interface.
•User thesaurus. Most current thesauri, including LCSH, are designed primarily for the indexer/cataloger. These we may call indexer thesauri. There have long been debates as to whether the LCSH list should be made available to users. There are good reasons for the debates, because on the one hand the searcher should have access to a compact (relatively!) list of headings instead of having to look many places in the catalog, while on the other hand, the LCSH list could easily confuse the naive user.
Many headings in a library's catalog are nowhere to be seen in the LCSH list, because they are pattern headings, or use floating subdivisions. On the other hand, many headings appear in the list that are not used in a particular library. Until recently, confusing codes, such as "x" and "xx," appeared, and scope notes were generally written to clarify the sorts of confusions that catalogers would have, not the ones felt by end users. Even see references, though intended for users also, are written in the grammar of subject headings; in other words, they have similar form and parts of speech to those of headings, so the user must have at least a minimum level of familiarity with the patterns of subject headings in order even to come up with the "wrong" (see from) terms!
My own research has confirmed the latter situation. Students, without access to the LCSH list--a condition common in many libraries--were asked to write down the headings they would search under to find books on the same topic as books each described by a title and abstract. Library students who had studied cataloging and non-library university students both performed the task. Both groups were nonspecialists in the fields they were asked to describe. The library students did almost twice as well as the non-library students in coming up with the correct headings for the books.14 In other words, just being familiar with the form and syntax of Library of Congress subject headings in general enabled the library students to succeed much more often in producing part or all of the correct heading or a see reference for the books.
A user thesaurus, on the other hand, would be designed primarily for the user, and address the questions and confusions felt by the user in searching. Headings actually used in that library would somehow be indicated and/or explained, many scope notes and definitions would be included, and cross reference terminology would be self-explanatory. Finally, there would be far more entry (see from) terms than presently. Much greater variety in see from terms would be present to accommodate the colloquial and other popular labels for topics. (See also Bates.15)
• Vocabulary for online search features. As noted earlier, in an online catalog the searcher generally has additional search capability indexing available, and will choose among the various options--pre-existing indexing and instant indexing--to fulfill a particular search need. Therefore, it seems appropriate that the user thesaurus should reflect this fact, and provide assistance with the other (instant) indexing terms as well.
It may sound contradictory to have a thesaurus that includes terms that are not controlled and are not "see from" terms either. Here, we can take a leaf from the experiences of online database searchers, who have been dealing for much longer with the same problem we are discussing, namely, the availability of both controlled vocabularies and many powerful "free text" capabilities (ability to search on words or phrases anywhere in certain fields, regardless of whether the terms are controlled or not--similar to keyword searching in online catalogs).
Because of the availability of these search features, online database searchers have gotten in the habit of routinely considering terms in both controlled and uncontrolled vocabularies for their search formulations. Controlled vocabulary from one database can also be used in a free text mode in another one. The indexer or cataloger is naturally concerned with identifying the "correct," controlled, vocabulary; the online searcher, on the other hand, finds that the line between legitimate and non-legitimate indexing terms has become blurred.
Usually, free text searching includes the possibility of doing what is, in effect, keyword searching on the controlled vocabulary as well as the natural language text in abstracts and other fields. Consider a single-word keyword match on a multi-word subject heading. What kind of indexing is that--controlled or uncontrolled?! It can be seen that the distinctions between controlled and uncontrolled or between pre-existing indexing and search capability indexing blur, the more power one gains in online searching capabilities.
Many thesauri do not yet recognize these realities of online searching, and stick mainly to formal, legitimate terminology, but some thesauri are beginning to be designed specifically for online searching (cf. Piternick16). While we are talking about producing a true user thesaurus for online catalogs, let us design it for these actual online searching conditions.
The thesaurus that, to my knowledge, most fully reflects these new realities exists as an online database (a database totally devoted to the thesaurus, no indexed documents are present), and is called TERM in the BRS search system. TERM contains merged thesauri from several fields (education, psychology, and medicine, among others), as well as natural language terms suggested by practicing searchers and found in database records and in reference books. Even suggested Boolean combinations are included.
When a searcher inputs a term or phrase into the BRS TERM database, an entry is printed out, listing possible alternative terms, that may be up to several dozen lines long. Since searchers do so much free text searching, the presence of vocabulary, controlled or uncontrolled, from other disciplines, only enriches the possible choices of terms for any one search. Furthermore, Boolean combinations are suggested by prompting the searcher to AND terms from each of two columns containing variant terms for component concepts in the topic phrase of the entry. For example, in the entry for ethnopsychology, the searcher is prompted to AND terms from one column dealing with ethnicity and another column of terms dealing with racial identification and self concept. The database sources of controlled vocabulary terms are indicated so that when the searcher does wish to use controlled terms in just the one source database, he or she may do so. An online catalog thesaurus that contained these features would greatly improve the resources available to the catalog user.
There are implications of the above discussion for the issue that has recently arisen in our field, namely, how to handle multiple thesauri in online catalogs.17 Whatever we do for the indexer/cataloger in the way of displaying various thesauri, we may want to make the public user thesaurus display different, and better adapted to actual end user needs. Including free text terms and possible Boolean combinations are examples of assistance that would be of little use to catalogers assigning controlled vocabulary, but of great use to end users.
USER-SYSTEM INTERFACE DESIGN FOR ONLINE CATALOGS
In the previous sections we considered new ways of thinking about the process of online searching and the means of assisting online catalog users. There remains one more piece of the puzzle to be discussed. More and more evidence is accumulating that indicates that we must change some fundamental assumptions about the nature of subject access and about the way people mentally process "subject" as they search for a topic. Some of this evidence has been around for many years; some has appeared as recently as 1988 as I write this. Now at last, when online catalogs are coming rapidly into use, we have the power to address the implications of this information and alter the design of subject access accordingly. It is particularly important to address this matter in the current relatively early, fluid, stage of online catalog design. Later, as design becomes standardized, and commitments are made to certain approaches, it will be much harder to use what we know to design systems that truly accommodate users.
A fundamental assumption of subject cataloging--so basic that it is seldom discussed--is that, as a rule, subject concepts can each be well described with one heading and, as needed, one to a handful of see references for synonyms of the heading. That is, by our providing this range of access terms for subject concepts (which terms are, in turn, applied to documents written on the concepts), the great majority of users will be able to match up their search term with a heading or a see reference describing items of interest to them in the catalog.
Now, however, it appears that this model of a catalog user's thinking and search process may be highly inaccurate. Studies in office automation, psychology, and in several subfields of library/information science all suggest that the human mind produces very much greater variety of description of concepts than we have assumed. Furthermore, and of even greater significance, even the most frequent of the terms are used by only a small minority of people.
I have reviewed the research at length in another article.18 Here, I will mention some examples, and add the results from the most recent research, which has appeared since that article. Furnas, et al. were interested in identifying the best names to use for text-editing operations so that these names could be used in the design of automated text-editing systems. They did several studies, which produced similar results. For example, in one, 48 secretarial and high-school students were given a sample manuscript with authors' corrections and asked "to prepare a typed list of instructions for someone else who was actually going to make the changes but did not have the author's marks".19 Here, one might expect the range of terms to be smaller than is the case in our field, because very specific concrete operations were being described, rather than the topic of an information need. Yet the authors note: "The most striking result from the verbal production data was the great diversity in people's descriptions.... The average likelihood of any two people using the same main content word in their descriptions of the same object ranged from about .07 to .18."20
Lilley and I, in separate card catalog studies, found low frequencies for search terms. Lilley asked 340 students to give subject headings that they might search on to find six books. An average of 62 different headings were suggested for each book.21 In my study, students were asked to state the search term they would use to find a book just like the one described in an abstract.22,23 The study was not designed to examine intersearcher consistency, but when I recently scanned the responses given by the undergraduate and graduate students in the study I found the same enormous variety found by Lilley. For example, 71 students responded to the first book in the study; they produced 46 different headings (some varying by singular/plural only), no one of which was suggested by more than six people.
The most recent research is that of Tefko Saracevic, who, in a large, federally- funded study, examined various features of online searching performance.24-26 In one of the substudies, he computed the degree of agreement among professional searchers in search terms used for the same test questions. He compared five searches on each of 40 test questions, a pair at a time. (Incidentally, the 40 questions were all real information needs, not manufactured queries.27) So, in total, there were 800 pairwise comparisons.28 Looking at the degree of agreement between searchers in terminology used, he found the following results: In 56 percent of the comparisons, the overlap in terms used was 25 percent or less, and in fully 94 percent of the comparisons the overlap in terms was 60 percent or less. In only 1.5 percent of the cases were the search formulations identical. 29
We may summarize the results from all the above studies, as well as the others discussed in Bates,31 by saying the following: If, to a large group of subjects (i.e., people in a research study), you give a task that requires them to generate terms for a concept, even the most frequently used term will, on average, be mentioned by only about 10 to 20 percent of the subjects. Each of the other terms will be used by even fewer people.
So instead of people's descriptions clustering around just one or two terms, there may be dozens used, and even the most popular terms are used by a relatively small minority. Thus, one used heading and a couple of see references for synonymous terms will probably capture, that is, match with, only a minority of the entry terms used by catalog searchers for a given topic.
This great variety of search terms, matched against the one or two headings that are assigned per book, produces some other results that are not surprising, given the above. Karen Markey reviewed several studies of search success on a variety of online catalogs. She found that 35 to 50 percent of keyword searches of subject heading fields resulted in no retrievals at all .32 In her own research, she found, furthermore, that many of the cases where there was a match with a term, no relevant materials were found, i.e., it is likely that some other access term was needed to locate the desired material.33
This situation is not acceptable. No matches or matches with irrelevant headings are searches that are defeating the whole purpose of public access catalogs. The fault cannot be attributed to inadequacies of online system features, either, because my aforementioned research with card catalogs showed the same low-match and irrelevant-match pattern. It is much more likely that the problem lies with this mismatch between the way our minds work and the way our catalogs have traditionally been designed conceptually.
How shall we handle this situation? Supplementing the indexing by adding two or three additional headings would help but would not do the job. Something in the neightborhood of ten to 30 or more additional headings would be needed to really cover the range of search terms likely to be used by catalog users--a totally impractical solution.
Do these results then make authority control pointless? Not at all. The benefits of consistency, accuracy, and control in subject description are not lost because we find that people use a wide variety of terminology in searching. But we must distinguish document description from access. We may describe documents compactly, with just a few headings, but we should then provide some way to channel searchers from the initial wide variety of expression to the much more limited number of terms actually used to index relevant documents.
So now let us return to the idea of a user thesaurus. If said thesaurus were expanded and enriched as a front-end database, a Superthesaurus, it could contain an enormous variety of entry terms, with all sorts of forms of guidance for the searcher to enable him or her to decide on the best terms for a given search. Hierarchical relationships could be displayed, including multiple hierarchies when terms reside in several hierarchies. Where colloquial entry terms are ambiguous, the several relevant controlled terms could be displayed, enabling the searcher to realize that more than one interpretation can be given to a term.
Some of the variety of entry terms is due to word form variations on the same root, word order variations, and use of various combinations of subsets of component terms in the topic description. The word form variation can be reduced through a stemming algorithm, and word order problems can be handled by implicit Boolean AND on component words of the search term. Searchers also frequently use too many terms in their search formulations, or enter only one or two broad terms, and consequently get null sets or huge sets. For searchers with these problems, either automatic modification of search formulation, or facilitative suggestions on-screen could be provided.
Most importantly, the searcher would never have that dreadful feeling of entering a perfectly reasonable word or phrase for a topic and finding that the library apparently had nothing on it. Many searchers give up at this point, not appreciating the complexity of subject description. Here we would be implementing what I have elsewhere called the "Side of the Barn Principle," namely, the searcher need only "hit the side of the barn" with an initial entry term. Any reasonable English language word or phrase should get some response, some recommended alternative term(s) or a display of related terms to guide the searcher to the best headings actually used to index the desired topic.34
For those users who are not interested in sorting through vocabulary to find the best, and just want the system to do it all, the presence internally of the vast superthesaurus network would make it possible for system algorithms to be developed that would automatically link search terms with likely correct headings.35 This question of degree of transparency of system will probably ultimately be resolved by providing two or more options, to accommodate both the active explorers and the passive "do-it-for-me" searchers.
We may make an analogy here to automobile design. The automatic shift and the stick shift are both available in the automobile market because they satisfy different kinds of needs and personalities. The same can be said for automatic and manually set cameras. Some people are confused by manual controls and only want "point and shoot" cameras, while others insist on maintaining control over setting themselves and would not dream of allowing the camera to do it for them. No matter how sophisticated "automatic" information searching interfaces become, there will probably also remain a desire for control over the search by some (or many) searchers some (or all) of the time.
The exact system design of such a thesaurus and associated interface would depend on many practical factors, above all cost. However, some desirable features would be the following:
• The Superthesaurus should be independent of, yet linked to, the document indexing. A separate up-front database contains the thesaurus, with an associated user-friendly interface. A searcher may move around indefinitely in the thesaurus, following up linkages, i.e., the searcher can "helicopter" over the domain of terminology. When the user identifies a promising heading (one that is marked as being one that actually indexes documents), he/she may ask to see documents indexed under that heading without having to enter specific commands to withdraw from one database and enter another one.
• The Superthesaurus should contain a very large entry vocabulary and numerous different kinds of linkages and user aids: display of multiple hierarchies when a term falls within different contexts, display of related terms, definitions and scope notes, etc. These and other possibilities are discussed in more detail in Bates.36
• The system may be transparent or not, with the user either actively exploring the terminology or simply letting the system use the linkages to come up with likely document matches.
• Document indexing and re-indexing do not have to take place for every addition to the Superthesaurus; the links always lead to existing indexing. This situation makes for much more flexibility and rapidity in adding new terminology and changing links. Briefly popular colloquial terms can be added and later removed from the Superthesaurus if they fade from usage, without requiring re-indexing of the documents themselves.
• Once a basic skeleton thesaurus is in place, the Superthesaurus could be developed incrementally, with additional types of linkages added through time by various interested groups and organizations.
1. Charles R. Hildreth, "Pursuing the Ideal: Generations of Online Catalogs," in Online Catalogs, Online Reference: Converging Trends; Proceedings of a Library and Information Technology Association Preconference Institute on June 23-24, 1983, eds. Brian Aveney and Brett Butler (Chicago: American Library Association, 1984), p. 31-56.
2. Gary S. Lawrence. "System Features for Subject Access in the Online Catalog." Library Resources and Technical Services 29:16-33 (January/March 1985).
3. Ann H. Schabas. "Postcoordinate Retrieval: A Comparison of Two Indexing Languages." Journal of the American Society for Information Science 33:32-37 (January 1982).
4. Ibid., p. 35.
6. Anthony G. Curwen. "A Decade of PRECIS, 1974-84." Journal of Librarianship 17:244-267 (October 1985).
7. H. Mary Micco, "A Comparison of Subject Access Systems in Medicine: LCSH, MESH, PRECIS," in Proceedings of the 48th ASIS Annual Meeting on October 20-24, 1985, ed. Carol A. Parkhurst (White Plains, New York: American Society for Information Science, 1985), p. 41-53.
8. Karen Markey. Subject Searching in Library Catalogs: Before and After the Introduction of Online Catalogs (Dublin, Ohio: Online Computer Library Center OCLC, 1986), p. 76-77.
9. Joseph R. Matthews and others, eds. Using Online Catalogs: A Nationwide Survey (New York: Neal-Schuman, 1983), p. 144.
10. Ray R. Larson, personal communication.
11. Marcia J. Bates. "Factors Affecting Subject Catalog Search Success." Journal of the American Society for Information Science 28:161-169 (May 1977).
12. Marcia J. Bates. "System Meets User: Problems in Matching Subject Search Terms." Information Processing and Management 13: 367-375 (1977).
13. Marcia J. Bates. "How to Use Controlled Vocabulary More Effectively in Online Searching." Online, in press.
14. Bates, "Factors," p. 166.
15. Marcia J. Bates. "Subject Access in Online Catalogs: A Design Model." Journal of the American Society for Information Science 37:357-376 (November 1986).
16. Anne B. Piternick. "Searching Vocabularies: A Developing Category of Online Search Tools." Online Review 8:441-449 (October 1984).
17. Carol A. Mandel. Multiple Thesauri in Online Library Bibliographic Systems: A Report Prepared for Library of Congress Processing Services. (Washington, DC: Library of Congress, Cataloging Distribution Service, 1987).
18. Bates, "Subject Access."
19. George W. Furnas and others, "Statistical Semantics: How Can a Computer Use What People Name Things to Guess What Things People Mean When They Name Things?" in Proceedings of the Human Factors in Computer Systems Conference, March 15-17, 1982, Gaithersburg, MD (New York: Association for Computing Machinery, 1982), p. 251.
20. Ibid., p. 252.
21. Oliver L. Lilley. "Evaluation of the Subject Catalog." American Documentation 5:42 (April 1954).
22. Bates, "Factors."
23. Bates, "System."
24. Tefko Saracevic and others. "A Study of Information Seeking and Retrieving I. Background and Methodology." Journal of the American Society for Information Science 39:161-176 (May 1988).
25. Tefko Saracevic and Paul Kantor. "A Study of Information Seeking and Retrieving. II. Users, Questions, and Effectiveness." Journal of the American Society for Information Science 39:177-196 (May 1988).
26. Tefko Saracevic and Paul Kantor. "A Study of Information Seeking and Retrieving. III. Searchers, Searches, and Overlap." Journal of the American Society for Information Science 39:197-216 (May 1988).
27. Saracevic, "II," p. 177.
28. Saracevic, "III," p. 203.
29. Ibid., p. 204.
31. Bates, "Subject Access," p. 360-361.
32. Karen Markey. "Integrating the Machine-Readable LCSH in Online Catalogs." Information Technology and Libraries (in press).
34. Bates, "Subject Access," p. 365.
35. Markey, "Integrating."
36. Bates, "Subject Access," p. 368-374.