Google Geek

Friday, October 06, 2006

Google Groups

Usenet groups, text-based discussion groups that cover literally hundreds of thousands of topics, have been around since long before the World Wide Web. Deja News used to be the repository of Usenet information until it sold off its archive to Google in early 2001. Google filled it out even further and relaunched it as Google Groups (http://groups.google.com). Its search interface, is rather different from the Google Web Search, as all messages are divided into groups, and the groups themselves are divided into topics called hierarchies.

The Google Groups archive begins in 1981 and covers up to the present day. Just shy of 850 million messages are archived. As you might imagine, that's a pretty big archive, covering literally decades of discussion. Stuck in an ancient computer game? Need help with that sewing machine you bought in 1982? You might be able to find the answers here.

Google Groups also allows you to form your own ad hoc groups to collaborate on or discuss topics. See the Google Groups tour (http://groups.google.com/intl/en/googlegroups/tour/index.html) for instructions on how to create your own newsgroup. You have to first choose where you want your group to be categorized, which means understanding the hierarchy.

Ten Seconds of Hierarchy Funk

There are regional and smaller hierarchies, but Usenet relies on alt, biz, comp, humanities, misc, news, rec, sci, soc, and talk. Most Usenet groups are created through a voting process and are put under the hierarchy that's most applicable to the topic. But you can create a group that's available via Google Groups without any input.

Browsing Groups

From the main Google Groups page, you can browse through the list of groups by picking a hierarchy from the front page. You'll see there are subtopics, sub-subtopics, sub-sub-subtopics, andwell, you get the picture. For example, in the comp (computers) hierarchy, you'll find the subtopic comp.sys, or computer systems. Beneath that lie 75 groups and subtopics, including comp.sys.mac, a branch of the hierarchy devoted to the Macintosh computer system. There are 24 Mac subtopics, one of which is comp.sys.mac.hardware, which has, in turn, 3 groups beneath it. Once you've drilled down to the most specific group applicable to your interests, Google Groups presents the postings themselves, sorted in reverse chronological order.

This strategy works fine when you want to read a slow (i.e., containing little traffic) or moderated group, but when you want to read a busy, free-for-all group, you may wish to use the Google Groups Search engine. The search on the main page works much like the regular Google search, except for the Google Groups tab and the associated group and posting date that accompanies each result.

The Advanced Groups Search (http://groups.google.com/advanced_group_search), however, looks much different. You can restrict your searches to a certain newsgroup or newsgroup topic. For example, you can restrict your search as broadly as the entire comp hierarchy (comp* would do it) or as narrowly as a single group such as comp.robotics.misc. You can restrict messages to subject and author, or restrict them by message ID.

Possibly the biggest difference between Google Groups and Google Web Search is the date searching. With Google Web Search, date searching is notoriously inexact (date refers to when a page was added to the index rather than when the page was created). Each Google Groups message is stamped with the day it was actually posted to the newsgroup. Thus, the date searches on Google Groups are accurate and indicative of when content was produced.

Google Groups Search Syntax

By default, Google Groups looks for your query keywords anywhere in the posting subject, body, group name, or author name.

And, thanks to some special syntax, you can do some precise searching if you know the magic incantations:

insubject:

Searches posting subjects for query words:

insubject:rocketry

group:

Restricts your search to a certain group or set of groups (topic). The * (asterisk) wildcard modifies a group: syntax to include everything beneath the specified group or topic. rec.humor* or rec.humor.* (effectively the same) find results in the group rec.humor, as well as rec.humor.funny, rec.humor.jewish, and so forth:

group:rec.humor*
group:alt*
group:comp.lang.perl.misc

author:

Specifies the author of a newsgroup post. This can be a full or partial name, or even an email address:

author:fred
author:"fred flintstone"
author:flintstone@bedrock.gov

Mixing Syntaxes in Google Groups

Google Groups is much more friendly to syntax mixing than Google Web Search. You can mix any two or more syntaxes in a Google Groups Search, as exemplified by the following typical searches:

intitle:literature group:humanities* author:john
intitle:hardware group:comp.sys.ibm* pda

Some common search scenarios

There are several ways you can mine Google Groups for research information. Remember, though, to view any information you get here with a certain amount of skepticism. Usenet is just hundreds of thousands of people tossing around links; in that respect, it's just like the Web.

Tech support

Ever used Windows and discovered there's a program running you've never heard of? Uncomfortable, isn't it? If you're wondering if HIDSERV is something nefarious, Google Groups can tell you. Just search Google Groups for HIDSERV. You'll find that plenty of people had the same question before you did, and it's been answered.

I find that Google Groups is sometimes more useful than manufacturers' web sites. For example, I was trying to install a set of flight devices (a joystick, throttle, and rudder pedals) for a friend. The web site for the manufacturer couldn't help me figure out why they weren't working. I described the problem as best I could in a Google Groups searchusing the name of the parts and the manufacturer's brand nameand, though it wasn't easy, I was able to find an answer.

Sometimes your problem isn't as serious but it's just as annoying. For example, you might be stuck in a computer game. If the game has been out for more than a few months, your answer is probably in Google Groups. If you want answers to an entire game, try the magic word walkthrough. So, if you're looking for a walkthrough for Quake II, try the search "quake ii" walkthrough. (You don't need to restrict your search to newsgroups; "walkthrough" is a word strongly associated with gamers.)

Finding commentary immediately after an event

With Google Groups, date searching is very precise (unlike date-searching Google's Web index), so it's an excellent way to get commentary during or immediately after events.

Barbra Streisand and James Brolin were married on July 1, 1998. Searching for "Barbra Streisand" "James Brolin" between June 30, 1998 and July 3, 1998 leads to over 48 results, including reprinted wire articles, links to news stories, and commentary from fans. Searching for "barbra streisand" "james brolin" without a date specification finds more than 1,800 results.

Usenet is also much older than the Web and is ideal for finding information about an event that occurred before the Web. Coca-Cola released New Coke in April 1985. You can find information about the release on the Web, of course, but finding contemporary commentary would be more difficult. After some playing around with the dates (just because it's been released doesn't mean it's in every store), I found plenty of commentary about New Coke in Google Groups by searching for the phrase "new coke" during the month of May 1985. Information included poll results, taste tests, and speculation on the new formula. Searching later in the summer yields information on Coke re-releasing old Coke under the name "Coca-Cola Classic."

Advanced Groups Search

The Advanced Groups Search, is much like the Advanced Web Search and Advanced News Search.

Rather than fiddling with the special syntax detailed earlier, simply fill out the form, hit the Search button, and let Google Groups compose the query for you. You can restrict your search to a specific newsgroup or section of hierarchy (e.g., comp.os.*), a particular person, a particular language, or posts arriving in the past 24 hours, week, month, 3 months, 6 months, or year. You can even search for a particular message if you know the message ID. And since Usenet can be just as woolly as the Web, you might want to turn on SafeSearch.

Google News

At the time of this writing, Google News (http://news.google.com) culls over 4,500 news sourcesfrom the Scotsman to the China Daily, from the New York Times to the Minneapolis Star Tribune.

The front page, is updated algorithmically without any involvement by puny humansaside, of course, from those writing the news in the first placeseveral times a day. The "most relevant news" rises to the top.

Stories are organized into clusters, drawing together coverage and photographs from various news sources around the Web. Click the "all n related" link for a list of all stories falling within that cluster. Click "sort by date" to see how the story unfolded across sources over time.

All of this doesn't apply just to the front page, but to all the newspaper-like sections within: World, U.S., Business, Sci/Tech, Sports, Entertainment, and Health.

Google News Search Syntax

When you search Google News, the default is to search for your query keywords anywhere in the news article's headline, story text, source, or URL.

Google News supports the following special search syntax:

intitle:

Finds words in an article headline:

intitle:beckham

An allintitle: variation finds stories in which all the search keywords appear in an article headlineeffectively the same as using intitle: before each keyword:

allintitle:miners strike benefits

intext:

Finds search terms in the body of a story:

intext:"crude oil"

An allintext: variation finds stories in which all the search keywords appear in article texteffectively the same as using intext: before each keyword:

allintext:US stocks rebound

inurl:

Looks for particular keywords in a news story's URL:

ipod inurl:reuters

source:

Finds articles from a particular source. Unfortunately, Google News does not offer a list of its over 4,500 sources, so you have to guess a little. Also, you need to replace any spaces in the source's name with underscore characters; e.g., the New York Times becomes new_york_times (case-insensitive):

miners source:international_herald_tribune
"international space station" source:new_york_times

location:

Filters articles from sources located in a particular country or state. For country names consisting of more than one word, replace any spaces with underscore characters; e.g., South Africa becomes south_africa (case-insensitive). In the case of state names, use official abbreviations such as ca for California and id for Idaho:

"organic farming" location:france
election 2004 location:ca

Advanced News Search

Google Advanced News Search, is much like the Advanced Web Search. It provides access to the Google News special syntax from the comfort of a web form. Notice the set of fields and pull-down menus associated with Date; use these to search for articles published in the last hour, day, week, month, or between any two particular days.

Fill in the fields, click the Search button, and notice how your query is represented in the search box on the results page.

Making the Most of Google News

The best thing about Google News is its clustering capability. On an ordinary news search engine, a breaking news story can overwhelm search results. For example, in late July 2002, a story broke that hormone replacement therapy might increase the risk of cancer. Suddenly, using a news search engine to find the phrase "breast cancer" was an exercise in futility, because dozens of stories around the same topic were clogging the results page.

This doesn't happen when you search the Google News engine because Google groups similar stories by topic. You'd find a large cluster of stories about hormone replacement therapy, but they'd be in one place, leaving you to find other news about breast cancer.

Some searches cluster easily; they're specialized or tend to spawn limited topics. But other queries (such as "George Bush") spawn lots of results and several different clusters. If you need to search for a famous name or a general topic (such as crime), narrow your search results in one of the following ways:

Add a topic modifier that will significantly narrow your search results, as in: "George Bush" environment crime arson.
Limit your search with one of the special syntaxes. For example: intitle:"George Bush".
Limit your search to a particular source. Be aware that while this works well for a major breaking news story, you might miss local stories. If you're searching for a major American story, CNN is a good choice (source:cnn). If the story you're researching is more international in origin, the BBC works well (source:bbc_news).

Receiving Google News Alerts

Google Alerts keep tabs on your Google News searches, notifying you if any news stories appear that match your search criteria. They're easy to set up, alter, and deleteand they're free.

Tuesday, October 03, 2006

Beyond searching

Google Scholar may provide an easy way to search. However, with the constantly increasing quantity of scholarly data, Google Scholar will soon be facing a new challenge, as will database providers and metasearch systems: the comprehensive presentation of search results to the user.

The assumption underlying the implementation of relevance ranking and its use as a sorting order is that end-users will not scroll down and scan large amounts of data. Therefore, the results that are most likely to suit their research needs should appear at the top of the list. However, this sorting order has several drawbacks. As mentioned earlier, users have different research needs, and an item that is most relevant to one user may be less relevant to another.

Another problem with presenting search results in any type of linear list is that sometimes there are a great many results. Some users, particularly those who are novices, may not know how to define their queries effectively; however, once the system analyzes the set of results and provides options to narrow down the list, such users can easily drill down to the relevant subset of results.

Several companies have developed technologies that enable sites to cluster search results and offer drill-down options to end-users. One such company is Vivísimo, whose technology can be seen on the Web site of the Institute of Physics (IOP).

I am looking for information about the sine-Gordon equation. When I search the IOP Web site, the traditional display provides a list of 95 articles. However, I can opt to see the list clustered . As explained on the IOP site, "when you cluster your search results, you will find them presented (unchanged) on screen alongside folders representing the clusters generated. The folders are sorted according to the number of search results in each, and according to the overall rank of the individual search results in the search engine's output". I can select any of these topic clusters, thus narrowing down my list of results, and I can drill down even further and see only the results for a particular subtopic. In our example, I quickly identify "soliton" as the topic of interest, thus decreasing the number of relevant results to 35; and if I am seeking information about magnetic fields, I can drill down further to the "magnetic fields" subtopic and see a list of four records. Note, however, that the Vivísimo IOP implementation clusters only the first 250 records.

Conclusions

Google Scholar is becoming the object of greater attention from libraries, patrons, and publishers, regardless of librarian approval. Depending on Google's plans, Google Scholar may turn into a core resource for researchers. Perhaps the library community should encourage patrons to use this search engine when appropriate and keep a watchful eye on the quality of the results.

Google's attentiveness to the library community, as evidenced by the rapid implementation of the OpenURL standard in Google Scholar, indicates that this service might well be evolving in the right direction. Nevertheless, it is not likely to replace metasearch systems in the short term. A locally controlled and branded system that enables librarians to offer accurate, up-to-date, subject-specific research data and to customize relevant services renders metasearch systems highly valuable to the scholarly community.

Adopting Google Scholar

The library community is divided between those who welcome Google Scholar and those who reject it. A recent study conducted at the University of California (UC) reveals the varying attitudes of librarians toward Google Scholar. Some believe that it is a great tool and promote it actively, whereas others do not use or recommend it and prefer their institutional databases, which they describe as "reliable" or "real". In many cases librarians use Google Scholar as an additional resource when they are looking for old materials, Web materials not found in the institution's databases, or materials that relate primarily to interdisciplinary topics. According to some, the fact that Google Scholar provides links to the UC SFX link server (UC-eLinks) makes it even more valuable. A number of librarians also recommend Google Scholar to non-affiliated users, who have no access to the institutional databases. Librarians who find Google Scholar useful are trying to figure out ways to instruct patrons about when it is appropriate to search in Google Scholar as opposed to the institution's databases.

Google Scholar is clearly gaining patrons' attention at university libraries, and librarians are responding accordingly. At UC some librarians include Google Scholar in the curriculum of classes that they teach or provide explanations to patrons at the reference desk. The Los Angeles campus (UCLA) Web site offers instruction on Google Scholar, search engines, databases, and the research process. By comparing searching in Google Scholar to searching in PsycINFO, the site enables users to figure out what they win and what they lose with each of these resources. In addition, the UCLA Web site provides a comprehensive explanation about using the school's SFX link server from Google Scholar.

Other institutions post pages with frequently asked questions about Google Scholar, such as the page on the Web site of the University of Nevada, Las Vegas, which states, among other things, that "While we encourage you to try Google Scholar, keep in mind that this software is 'in Beta.' Beta status indicates that Google Scholar is still in development, and you may therefore encounter some inconsistencies or peculiarities. You may wish to supplement your research by searching some of the many other databases found on the 'Find Articles and More' page".

Google Scholar: pros and cons

Google Scholar is easy to use. It has a familiar look and feel, and it is accessible from anywhere, including Internet cafés all over the globe. It is extremely fast, it covers a broad, heterogeneous range of information sources, and it does not require any specific query structure. Now let us look at some other aspects of Google Scholar that might shed light on its usefulness as a scholarly resource.

The major questions about Google Scholar relate to the scope, coverage, and accuracy of the content. Google Scholar does not disclose information about its content. At the SFX-MetaLib User Group (SMUG) meeting that took place in June 2005 at the University of Maryland, Anurag Acharya, the chief engineer of Google Scholar, talked about providing the "best possible scholarly search" and a "single place to find scholarly materials" covering "all research areas, all sources, all time". At the time of the writing of this article, the goal has not been fully achieved.

First, scholarly materials provided by many publishers, for example, Elsevier, the American Chemical Society and Emerald, are not yet included in Google Scholar, although the metadata describing some of these publishers' materials finds its way to Google Scholar via other channels, such as the National Library of Medicine's PubMed.

Second, the material that Google Scholar incorporates from a publisher does not always provide complete coverage. Furthermore, updates are not frequent enough to always include the most recent articles.

An enlightening review by Peter Jacso compares the coverage of Google Scholar and that of the original publisher's repository; the results of his comparison indicate that Google Scholar provides only partial coverage. Although the review was published in December 2004, the situation is similar almost a year later. For example, a search in Wiley InterScience for "tsunami" in the title field yields seven results, whereas a search in Google Scholar with the scope limited to Wiley InterScience yields only five results - articles published in 2005 do not appear. A search in Google Scholar for "antimatter", with the scope limited to the Institute of Physics, misses three articles (published in 1973, 1999, and 2003).

When a user knows exactly what he or she is looking for, the partial coverage problem is less serious because the person is aware that the item is missing and can check other databases, such as those that are targeted to the user's area and are more up to date. However, when users are looking for content without knowing which articles, books, or other materials have been published in that area, they might miss valuable information by relying solely on Google Scholar. For some users, such as undergraduates who are looking for any available material, such partial coverage matters less; for researchers, the unrecognized absence of relevant material can be critical.

Another issue worth noting is the definition of scholarly materials. Here, too, we are not sure how Google evaluates what it finds and what criteria it uses for categorizing materials as scholarly or not, except for the obvious cases in which it harvests publishers' sites.

Google Scholar offers a multidisciplinary repository. Unlike metasearch systems that by nature provide both the library and the end-user with tools to define the scope of a search and send a query to only the most relevant resources, Google Scholar, by default, uses its entire repository to provide results. Hence, a search for "mercury", for example, yields results relating to the planet, the chemical element, and the musician Freddie Mercury (though the latter does not appear at the top of the list). This approach clearly facilitates interdisciplinary research but can hamper the effort to focus on a specific discipline.

The problem of the search scope has resonated enough to bring about the introduction of a new feature in the Google Scholar advanced search interface - the option to limit the search to one or more broad subject areas: biology, life sciences, and environmental science; business, administration, finance, and economics; chemistry and materials science; engineering, computer science, and mathematics; medicine, pharmacology, and veterinary science; physics, astronomy, and planetary science; social sciences, arts, and humanities. However, libraries cannot control this list, and the issue of whether the results of a search limited to a specific subject area are, indeed, applicable to that area has yet to be examined. A quick test shows that among the articles that come up in a search for "Mars" in the subject area of social sciences, arts, and humanities is "History of water on Mars: a biological perspective", published by researchers from the Space Science Division, NASA Ames Research Center. We can safely conclude that this article is not related to the selected subject area. It seems that Google Scholar has developed automated procedures to categorize the materials that it harvests, but such procedures still fall short of the database providers' classification methods, which are based on careful, human processes.

One of the major contributions to the success of Google in general is the relevance-ranking feature. Usually people find what they are looking for on the first page of results, thanks to the PageRank algorithm that Google uses to evaluate each Web page prior to user queries and without any relation to them. This algorithm is based on the number of links that point to the page from other Web pages, the number of links that point to those other Web pages, and so on. For Google Scholar, the algorithm had to be changed because of the different nature of the data. According to the information on the Google Scholar Web site, "the relevance ranking takes into account the full text of each article as well as the article's author, the publication in which the article appeared and how often it has been cited in scholarly literature".

Here, however, we run into a few problems. First, because Google has not publicized its content or the manner in which it determines whether material is 'scholarly literature', we have no way of knowing whether the number of citations is complete and accurate. Furthermore, as Google does not always identify duplicates (probably because of the heterogeneous nature of the metadata that it discovers while crawling the Web), the number of citations may not be realistic. For example, when we search in Google Scholar for the article "Library portals: toward the semantic Web", Google Scholar shows that the article has been cited six times; nevertheless, when we click the 'Cited by 6' link and look at the citations carefully, we can see that one publication appears twice, as both the first and sixth citations. Moreover, at least two other known citations are missing altogether. Yet Google Scholar uses citations to determine relevance ranking.

Whether systems that enable searches across scholarly materials should display the results of users' queries by relevance is not a simple question to answer. Relevance, in at least some cases, depends on context. Relevant to whom? For what purpose? Does the same relevance apply to an undergraduate who is looking for material for an introductory course in physics and a scientist who is searching for recent publications related to current research? The student might need a well known article that is not new, but the scientist is almost certainly not looking for that article. Furthermore, the usefulness of an item depends on the discipline of the researcher; for example, the articles that come up in a search for "plague" will differ in their relevance to a scholar of twentieth-century French literature and to an epidemiologist.

Roy Tennant offers a noteworthy example in his presentation "Is MetaSearch Dead?". He searched for "tsunami" in Google Scholar, Google, and the National Science Digital Library (NSDL). The first page of results in Google Scholar yielded no items with general information that an undergraduate would find useful. In Google, the first page included three results with useful scientific information, seven relief effort sites, and at least seven sponsored links (advertisements). But the first page of results at the NSDL listed 20 sites with useful scientific information. Perhaps these sites are hiding somewhere in the Google Scholar result list, but it is doubtful that any user will be able find them among the tens of thousands of results.

Interestingly, most bibliographic databases do not return results by relevance; such databases typically list results by date in descending order and enable the user to re-sort them by other criteria, such as author and title. Metasearch systems retrieve the results from the databases in the order set by each database and sometimes also provide options for other modes of display. For example, MetaLib from Ex Libris displays the results in the original order dictated by the database and also as one merged, de-duplicated list sorted by relevance. The end-user can re-sort the merged list by author, title, and date.

Google Scholar's choice of sorting criteria used for the display of scholarly materials represents a significant potential for power. People who are used to finding what they are looking for on the first page in Google are likely to adopt the same behaviour when using Google Scholar; thus highly cited items will gain more citations and will continue to appear at the top of the page. It is not obvious that this method of displaying results, the only one that Google Scholar provides, is indeed appropriate for scientific research.

One of the greatest advantages of Google Scholar, inherited from Google, is the simple interface, in terms of both design and functionality. Extremely intuitive, it is also available from any computer, with any browser. On the one hand, this interface is convenient for end-users, but, on the other, it does not allow for integration within a virtual library environment. Libraries typically want to provide their patrons with a complete user experience, encompassing content, design, and services, and they manage to do that quite successfully with their metasearch systems. With these systems, not only can libraries customize the user interface to create their own look and feel (typically for institutional branding), but they can also integrate their metasearch systems with their authentication environment, course management systems, and institutional portals; they have control over the resources that they offer, the categorization of those resources, the terminology, the display options, and the services that they provide for the end-users. Such services include a link to the library's holdings - be they electronic or print, local or remote; links to other relevant resources; functions that enable users to download records in the appropriate format, save and send citations, define alerts, create lists of favourite resources, and more.

Google Scholar, however, does not support integration in the virtual library environment. The Georgia State University library site, for example, makes an effort to introduce Google Scholar to its patrons, but when a patron clicks the link to Google Scholar, a new window opens without any university branding - the same Google Scholar page that users at any other library see.

Of much concern at the time that Google Scholar was launched was the lack of library control over the link to the electronic copy that Google Scholar provided for citations. Google Scholar did not address the 'appropriate copy' problem, despite the generally accepted solution offered by the OpenURL framework. As Herbert Van de Sompel, inventor of the OpenURL framework, explains, "This problem refers to the fact that such linking frameworks fail to provide links that lead from a citation of a journal article to the appropriate full-text copy of that article. A full-text link typically leads to a publisher-defined default copy of the article, which usually resides in the publisher's repository. However, access to the copy of the article that is appropriate in the context of a certain user may very well require the provision of an alternative link".

As a result of these concerns, Google Scholar was quick to adopt the OpenURL standard. Following a short pilot project with selected libraries, Google Scholar became officially OpenURL enabled in May 2005. If a library opts to take advantage of this compliance, Google Scholar provides library-defined links to the user's institutional link server, for example, SFX, for many of the displayed citations (as long as specific metadata elements, such as ISSN and DOI, are available). On the basis of the user's IP or the affiliation preferences that he or she has set, Google Scholar identifies the user as belonging to a specific institution. The provision of a link to the institutional link server puts the control back in the hands of the librarians and allows the users of Google Scholar to take advantage of library holdings and services.

Under the assumption that users are typically most interested in electronic full text, Google Scholar has been designed to display a link to the institutional link server in a prominent place - next to the title - when the electronic full text is available, and when it is not, the link is displayed with the other links, underneath the citation. For Google Scholar to be able to alter the display of the link according to the availability of the full text, libraries must provide Google Scholar with the details of their electronic holdings. SFX, the link server from Ex Libris, automates the provision of holdings information to Google Scholar so that this task does not become a burden on the library staff.

Many librarians, however, did not readily accept this requirement. First, it contradicts one of the fundamental concepts of the OpenURL framework: the library should have full control over the user experience regarding the delivery of services. Second, because Google is a commercial company, some librarians are concerned that providing Google Scholar with holdings information may serve Google for matters other than the provision of links and hence does not comply with their mission as educational or research institutions that are commercially neutral. And third, this move requires that libraries maintain the information in a form that Google Scholar can harvest. In his presentation at the SMUG meeting, Acharya offered compelling arguments for providing holdings data. He highlighted the benefits of having the links in Google Scholar and explained the Google Scholar philosophy of informing the user in advance of whether the desired service, in this case the link to the full text, is available. Assuring libraries that they have a partner they can talk to, he emphasized the need to "step out of the mutual comfort zones" and work together.

Finally we come to a question that continues to puzzle the library community: what is the business model that Google has adopted for Google Scholar? At the time of the writing of this article, the Google Scholar site was not displaying advertisements. However, Google Scholar was still in the beta phase. If this policy changes, libraries may reconsider providing their holdings to Google Scholar and promoting its use in their institution. As with many other questions concerning Google Scholar, we can only wait and see what happens.

Google Scholar Versus Metasearch Systems

At the end of 2004, Google launched the beta version of a new service, Google Scholar, which provides a single repository of scholarly information for researchers. Will this service replace metasearch systems?

Metasearch systems are based on just-in-time processing, whereas Google Scholar, like other federated searching systems, is based on just-in-case processing. This underlying technology, along with Google Scholar's exceptional capabilities, accords Google Scholar a unique position among other scholarly resources. However, a year after its beta release, Google Scholar is still facing a number of challenges that cause librarians to question its value for scholarly research. Nevertheless, it has become popular among researchers, and the library community is looking for ways to provide patrons with guidelines for the most beneficial manner of using this new resource.

Metasearch systems have several advantages over Google Scholar. We anticipate that in the foreseeable future, libraries will continue to provide access to their electronic collections via their branded, controlled metasearch system.

Keywords

Metasearch, federated search, CrossRef CrossSearch, relevance ranking, Google Scholar, search engine, clustering search engine

Introduction

Google as a Web search engine has undoubtedly had a great impact on all those who search for information on the Web. The instant response, huge repositories, sophisticated search mechanism and relevance-ranking feature have combined to make Google the most popular Web search engine.

In late 2004, Google launched several exciting products, one of which is a beta version of Google Scholar. Aiming to provide a single repository for scholarly information, Google Scholar enables users to search for peer-reviewed papers, theses, books, preprints, abstracts, and technical reports in many academic areas. Furthermore, according to information released by Google, Google Scholar arranges results by relevance, taking into account the number of times that the item has been cited in scholarly literature, as well as other criteria. Equipped with this unique ranking process, unparalleled hardware resources, sophisticated crawling techniques, and access to published materials, Google is positioning Google Scholar to be an essential resource for the scholarly environment. In the not too distant future,

Google is likely to be facing rivals such as MSN and Yahoo!, who may offer similar products.

Still at the beta stage a year after its initial launch, Google Scholar has stimulated lively debate in the library community. Of particular interest to many is the question of whether Google Scholar is a potential competitor of metasearch systems and, if so, whether it will replace them or coexist with them as yet another channel to scholarly information.

Metasearching and federated searching

Before evaluating Google Scholar and its impact on the scholarly environment, let us examine the historical roots of the methodologies underlying systems such as Google Scholar.

We will start by clarifying the terms 'metasearch system' and 'federated search system' as used in this paper. These terms are frequently interchanged, but for our purposes, we would like to draw a distinction.

Metasearching, also known as integrated searching, simultaneous searching,

cross-database searching, parallel searching, and broadcast searching, is a process in which a user submits a query to numerous information resources simultaneously. The resources can be heterogeneous in many respects: their location, the format of the information that they offer, the technologies on which they draw, the types of materials that they contain, and more. The user's query is broadcast to each resource, and results are returned to the user.

The development of software products that offer metasearching relies on the fact that each information resource has its own search engine. The metasearch system transmits a user's query to that search engine and directs it to perform the actual search. Upon receiving the results of the search, the metasearch system displays them to the user. This process involves, first, the adaptation of the query's format to the specific requirements of the search engine at the target's end, and next, the conversion of the results to a unified format. The unified format later enables the metasearch system to process the results further - including displaying them in a consistent manner, merging them, and de-duplicating them.

We can describe metasearching as just-in-time processing. That is, instead of pre-processing the data, the metasearch system processes it only when the user launches a query.

Metasearch systems, therefore, hold information about how a resource can be searched and how results can be extracted from it, but they do not contain any of the data that is stored in any of the resources that they can access. For an in-depth discussion of metasearching.

In federated searching, a wealth of information is incorporated into a single repository that can be searched. In this model, the information is processed prior to the user's search. From the end-user's point of view, federated searching and metasearching may seem similar, because both provide a single interface to multiple resources, but they actually differ in many respects. The pre-processing taking place in a federated searching environment, which we can describe as just-in-case processing, offers new opportunities regarding search methodologies and the presentation of results. For example, a ranking algorithm can be applied to each data element stored in the repository, unrelated to any future user query. Such an algorithm can take into account the number of times that an article has been cited, the number of articles that the author has published, the number of times that a book has been borrowed, a journal's impact factor, and other parameters. A federated searching system can use the calculated rank to better evaluate the relevance of the specific item once it has been retrieved as the result of a query

Looking back a few years, we can see that the need for a single search interface to multiple resources arose some time ago, and, in fact, metasearching and federated searching have been available for quite some time. Such systems originated in a variety of environments; for example, Elsevier, a publisher offering numerous journals, created a federated search mechanism enabling its users to search all its e-journals through its ScienceDirect service. As Elsevier acquired other publishers, it was able to add their journals to the same platform.

Database vendors developed similar mechanisms. For example, Ovid provides a single interface to a few hundred databases that it publishes, and still retains them as separate databases. Commercial organizations were not the only ones that addressed the need for a single search interface; several large research institutions created a local environment based on federation. For example, the Los Alamos National Laboratory and the OhioLink consortium in the United States, the University of Toronto in Canada, the Technical Knowledge Center of Denmark (DTV), and the Max Planck Society in Germany all offer large, diverse collections of e-journals that they store locally. These institutions have implemented federated searching to provide a single search interface across their electronic collections.

However, not all organizations have the resources to adopt this just-in-case approach. Furthermore, with the rapid increase in the number of heterogeneous resources that institutions offer their users, a single federated searching system can serve only as a partial solution.

Library system vendors took a major step toward metasearching when they implemented the Z39.50 search-and-retrieve protocol, which enables them to provide access to library catalogues. Despite the wide adoption of this protocol, this solution could not scale up to provide a single access point to numerous resources. Hence, we saw the emergence of dedicated metasearch systems as we know them today.

The market's quick acceptance of metasearch systems indicates that libraries do indeed have a need that these systems can fulfil. For example, well over 500 institutions have acquired the Ex Libris MetaLib system since 2001, and many other such metasearch systems are offered in the marketplace. The ability to provide a single, friendly interface to multiple resources enables libraries to better address the changing expectations of their users, users who in the meantime have become accustomed to Google and Amazon.

Libraries have not only adopted metasearch systems at a rapid pace, but they have also advocated the development of new standards related to the metasearch process and are sharing their concerns with information providers and metasearch system vendors about the accuracy of searches and the burden that remote searches place on target resources. The active involvement of information providers kicked off the NISO Metasearch Initiative, whose aim is to provide the industry with a set of standards that will facilitate and optimize metasearching. This NISO initiative has been the focus of much discussion in the last couple of years, and apparently numerous stakeholders - publishers, librarians, and metasearch system vendors - agree on the value of formulating standards in this area.

Of particular interest to the providers of metasearch systems are the Semantic Web developments spearheaded by Tim Berners-Lee and the World Wide Web Consortium (W3C). A Semantic Web approach would facilitate the interaction between a metasearch system and any number of target resources without requiring prior programming for each target resource. The ideal solution is for the metasearch system to receive resource-specific information at the time of the actual interaction and formulate the flow of the interaction on the basis of this information.

Sunday, October 01, 2006

Compare Google and Yahoo! Search Results

Pit Google and Yahoo! against each other and find more search results in the process.

If you've ever searched for the same phrase at both Google and Yahoo!, you've probably noticed that the results can be surprisingly different. That's because Google and Yahoo! have different ways of determining which sites are relevant for a particular phrase. Though both companies keep the exact way of how they determine the rank of results a secretto thwart people who would take advantage of itboth Yahoo! and Google provide some clues about what goes into their ranking system.

At the heart of Google's ranking system is a proprietary method it calls PageRank, and Google doesn't give detailed information about it. But Google does say this:

Google's order of results is automatically determined by more than 100 factors, including our PageRank algorithm.

Here's the official word from Yahoo!:

Yahoo! Search ranks results according to their relevance to a particular query by analyzing the web page text, title, and description accuracy as well as its source, associated links, and other unique document characteristics.

Though we might never know exactly why results are different between the two search engines, at least we can have some fun spotting the differencesand end up with more search results than either one of the sites would have offered on their own.

One way to compare results is to simply open each site in separate browser windows and manually scan for differences. If you search for your favorite dog breedsay, "australian shepherd"you'll find that the top few sites are the same across both Yahoo! and Google, but the two search engines quickly diverge into different results. At the time of this writing, both sites estimate exactly 1,030,000 total results for this particular query, but estimated result counts might be a way to spot differences between the sites.

Viewing both sets of results in different windows is a bit tedious, and a clever Norwegian developer named Asgeir S. Nilsen has made the task easier, at a site called Twingine.

Twingine

The Twingine site (http://twingine.com) contains a blank search form into which you can type any search query. When you click Search, the site brings up the results pages for that query from both Yahoo! and Google, side by side. To be fair, the sides on which Google and Yahoo! appear change at random, so people who prefer one side of the screen to the other won't be biased. Plugging "australian shepherd" into Twingine yields

Clicking Next or Previous in the top frame at
Twingine takes you to the next or previous page in the search results at both sites.

Surfing the pages in the search results at Twingine can be a bit tricky. You'll probably want to open linked search results in a new window or tab, so that you can keep your place in the search results at both Yahoo! and Google. You can open links in a new window by right-clicking the link (Ctrl-click on a Mac) and choosing Open Link in New Window from the menu. You can also set your search preference at either search engine to automatically open links in a new window when you click a search result.

Yahoo! Versus Google Diagram

Another site, developed by Christian Langreiter, adds a bit of analysis to the different sets of search results between Yahoo! and Google. If you have Flash installed, you can type a search query into the form at http://www.langreiter.com/exec/yahoo-vs-google.html, and the site fetches the search results from both engines in the background using their open APIs.

Each blue or white dot in the diagram represents a search result URL, and the position of the dot represents the ranking. The dots on the far left are the top search results, and the further right you go, the further down you go in the search results. The blue lines represent the same URL, so you can see exactly where Google and Yahoo! line up.

In figure , you can see that the top search result for "australian shepherd" is the same URL, but the lines aren't as evenly matched further down in the results. As you hover over each dot, you see the URL, which you can click to visit that particular search result.

The white dots in the diagram represent a URL that one search has in the results that the other does not. And as this diagram demonstrates, neither search engine has a monopoly on matching pages, nor does each engine's index have every page on a particular topic.

If you already do serious research with search engines, you're very aware that having several search tools at your disposal is better than relying on one. And with the methods mentioned in this blog, you can compare and contrast the tools, giving you more results to choose from.

Your Own Google Search Form

Build your own personal, task-specific Google search form.

If you want to do a simple search with Google, you need only the standard Simple Search form (the Google home page). But if you want to craft specific Google searches to use on a regular basis or provide for others, you can simply put together your own personalized search form.

Start with a garden-variety Google search form; something like this will do nicely:

This is a very simple search form. It takes your query and sends it directly to Google, adding nothing to it. But you can embed some variables to alter your search as needed. You can do this in two ways: via hidden variables or by adding more input to your form.

Hidden Variables

As long as you know how to identify a search option in Google, you can add it to your search form via a hidden variable. The fact it's hidden just means that form users can't alter it. They can't even see it unless they look at the source code. Let's look at a few examples.

File Type

As the name suggests, File Type specifies that your results are filtered by a particular file type (e.g., Word .doc, Adobe .pdf, PowerPoint .ppt, plain text .txt). Add a PowerPoint file type filter, for example, to your search form, like so:

Site Search

Narrows your search to specific sites. While a suffix such as .com will work just fine, something more fine-grained such as the example.com domain is probably better suited:

URL Component

Specifies a particular path component to look for in URLs. This can include a domain name but doesn't have to. The following tries to tease out documentation in your result set:

Date Range

Narrows your search to pages indexed within the stated number of months. Acceptable values are between 1 and 12. Restricting your results to items indexed only within the last seven months is just a matter of adding:

Number of Results

Indicates the number of results you'd like to appear on each page, specified as a value of num between 1 and 100; the following asks for 50 per page:

What would you use this for? If you regularly look for an easy way to create a search engine that finds certain file types in a certain place, this works really well.

Creating Your Own Google Form

Some variables work well hidden; however, for other options, you can give your form users visible options to provide more flexibility.

Let's go back to the previous example. You want to let your users search for PDF files, but you also want them to be able to search for Excel and Microsoft Word files. In addition, you want them to be able to search not only oreilly.com, but also the State of California or the Library of Congress web sites. Obviously, there are various ways to design this form; this example uses a couple of simple pull-down menus.

FaganFinder (http://www.faganfinder.com/engines/google.shtml) is a wonderful example of a thoroughly customized form.

If you find yourself running fairly complex queries on a regular basis, you can speed things up by setting a few options in a custom form. And chances are good that if you find the convenience of a custom form helpful, others will too. So, making your custom form available on your web site is a good way to let others share in your productivity.

Cover Your Bases

Try all possible combinations of your search keywords at once, and find related keywords with Google Sets.

Imagine you have a set of query words but are not sure that they're the right set; you certainly don't want to miss any results by picking the wrong combination of keywords, including or excluding the wrong word. But the thought of typing a dozen-plus permutations of keywords has your carpal tunnel flaring up in horror. With some existing tools, you can fine-tune your Google queries by playing with word setsleading you down paths you might not have discovered.

Search Grid (http://blog.outer-court.com/search-grid), by German programmer Philipp Lenssen, lets you explore a wide range of Google search results by automatically searching for multiple combinations of keywords you specify. This gives you a quick overview of paths you can follow for a given set of keywords. You might, for example, put catsup, mustard, and pickles on the x-axis and relish, onions, and tomatoes on the y-axis.

Note that you get nothing but the first result; this is not the tool to use if you want an in-depth search of each query. Instead, it's meant to give you a bird's-eye view of how the different combinations of search words impact the query.

There's also a version of Search Grid that's been integrated into a web tool called FindForward (http://www.findforward.com/?t=grid), which gives you screenshots of some Google search results. FindFoward requires less typing: enter two to five words for which you want to check possible permutations. You get a large grid of search results, with screenshots available for some of the pages

Note that this grid searches each of your keywords individually (one square for mustard, one for pickles, one for relish) and searches every possible combination of two words (pickles relish, pickles mustard, mustard relish, etc.), but it doesn't search for three- and four-word permutations. In other words, this tool doesn't find every last possible permutation of your search. Again, it's an overview that gives you an idea of how different word combinations can affect your search, and it is not meant to be exhaustive.

Buy why limit yourself to keyword sets that you can dream up? Google has its own tool in development to expand your keyword vocabulary based on a small set of words. Google Sets (http://labs.google.com/sets) allows you to enter several keywords and have Google predict similar keywords in a large or small set. For example, plug catsup, mustard, and pickles into the form and click Large Set. You should see a list of 25 or more words that run the condiment gamut from Lettuce to Black Olive

Find Directories of Information

Use Google to find directories, link lists, and other collections of information.

Sometimes you're more interested in large information collections than scouring for specific bits and bobs. You could always take a stroll through the Google Directory (http://directory.google.com) to see what's available, but sometimes a topic-specific directory is what you need.

Using Google, there are a couple of different ways to find directories, link lists, and other information collections from across the Web. The first uses Google's full-word wildcards and the intitle: syntax. The second is a judicious use of particular keywords.

Title Tags and Wildcards

Pick something you'd like to find collections of information about. We'll use "trees" as our example. The first thing we look for is any page with the words "directory" and "trees" in its title. In fact, we build in a little buffering for words that might appear between the two using a couple of full-word wildcards (* characters). The resultant query looks something like this:

intitle:"directory * * trees"

This query finds "directories of evergreen trees," "South African trees," and of course "directories containing simply trees."

What if you want to take things up a notch, taxonomically speaking, and find directories of botanical information? Use a combination of intitle: and keywords, like so:

botany intitle:"directory of"

and you get almost 10,000 results. Changing the tenor of the information might be a matter of restricting results to those coming from academic institutions. Appending an edu site specification brings you to:

botany intitle:"directory of" site:edu

This gets you around 150 results, a mixture of resource directories, and, unsurprisingly, directories of university professors.

Mixing these syntaxes works rather well when searching for something that might also be an offline print resource. For example:

cars intitle:"encyclopedia of"

This query pulls in results from Amazon.com and other sites that sell car encyclopedias. Filter out some of the more obvious book finds by tweaking the query slightly:

cars intitle:"encyclopedia of" -site:amazon.com
-inurl:book -inurl:products

The query specifies that search results should not come from Amazon.com and should not have the word "products" or "book" in the URL, which eliminates a fair amount of online stores. For some interesting finds, play with this query by changing the word "cars" to whatever you like.

If mixing syntaxes doesn't find the resources you want, there are some clever keyword combinations that might just do the trick.

Finding Searchable Subject Indexes with Google

There are a few major searchable subject indexes and myriad minor ones that deal with a particular topic or idea. You can find the smaller subject indexes by customizing a few generic searches. "what's new" "what's cool" directory, while gleaning a few false results, is a great way to find searchable subject indexes.

directory "gossamer threads" new is an interesting one. Gossamer Threads is the creator of a popular link directory program. This is a good way to find searchable subject indexes without too many false hits.

directory "what's new" categories cool doesn't work particularly well, because the word "directory" is not a very reliable search term, but you will pull in some things with this query that you might otherwise have missed.

Let's put a few of these into practice:

"what's new" "what's cool" directory phylum
"what's new" "what's cool" directory carburetor
"what's new" "what's cool" directory "investigative journalism"
"what's new" directory categories gardening
directory "gossamer threads" new sailboats
directory "what's new" categories cool "basset hounds"

The real trick is to use a more general word, but make it unique enough that it applies mostly to your topic and not to many other topics.

Take acupuncture, for instance. Start narrowing it down by topic. What kind of acupuncture? For people or animals? If for people, what kinds of conditions are being treated? If for animals, what kinds of animals? Maybe you should search for "cat acupuncture", or maybe you should search for acupuncture arthritis. If this first round doesn't narrow the search results enough, keep going. Are you looking for education or treatment? You can skew results one way or the other using the site: syntax. So maybe you want "cat acupuncture" site:com or arthritis acupuncture site:edu. By taking just a few steps to narrow things down, you can get a reasonable number of search results focused around your topic.

Saturday, September 30, 2006

Look Up Definitions

Do you find yourself smiling knowingly when your boss mentions that well-known business principle you've never heard of? Overwhelmed with "geek speak"? Chances are Google's heard it mentionedand possibly even definedsomewhere before.

Most specialized vocabularies remain, for the most part, fairly static; words don't suddenly change their meaning all that often. Not so with technical and computer-related jargon. It seems like every 12 seconds someone comes up with a new buzzword or term relating to computers or the Internet, and then 12 minutes later it becomes obsolete or means something completely differentoften more than one thing at a time. Maybe it's not that bad. It just feels that way.

Google can help you in two ways: by helping you look up words and by helping you figure out what words you don't know but need to know.

Google Definitions

Before you assume you're going to be in for a lot of Googling, Simply prepend the definition you're after with the special syntax keyword define, like so:

define google juice
define julienne
define 42

Google tells you that these are defined as "power of a website to turn up in Google," "cut food into thin sticks," and "being two more than forty," thanks to Wikipedia, Low Carb Luxury, and WordNet at Princeton, respectively.

Click the associated "Definition in context" link to visit the page from which the definition was drawn.

Click the "Web definitions for..." link or prefix the word you're defining with define: (note the addition of a colon) in the first place, and you'll net a full page of definitions drawn from all manner of places. For instance, define:TLA finds turns up oodles of definitions (all about the same, mind you)

If all that didn't turn up anything useful, move on to Google Web Search proper.

Slang

We have distinctive speech patterns that are shaped by our educations, our families, and our location. Further, we may use another set of words based on our occupation. When a teenager says something is "phat," that's slanga specialized vocabulary used by a particular group. When a copywriter scribbles "stet" on an ad, that's not slang, but it's still specialized vocabulary or jargon used by a certain groupin this case, the advertising industry.

Being aware of these specialty words can make all the difference when it comes to searching. Adding specialized words to your search querywhether slang or industry jargoncan really change the slant of your search results.

Slang gives you one more way to break up your search engine results into geographically distinct areas. There's some geographical blurriness when you use slang to narrow your search engine results, but it's amazing how well it works. For example, search Google for football. Now search for football bloke. Totally different result sets, aren't they? Search for football bloke bonce. Now you're into soccer narratives.

Of course, this is not to say that everyone in England automatically uses the word "bloke" any more than everyone in the southern U.S. automatically uses the word "y'all." But adding well-chosen bits of slang (which will take some experimentation) gives your search results a whole different tenor and may point you in unexpected directions. You can find slang from the following resources:

The Probert EncyclopediaSlang (http://www.probertencyclopaedia.com/slang.htm): This site is browseable by first letter or searchable by keyword. (Note that the keyword search covers the entire Probert Encyclopedia ; slang results are near the bottom.) The slang presented here is from all over the world. It's often cross-linked, especially drug slang. As with most slang dictionaries, this site contains material that might offend.
A Dictionary of Slang (http://www.peevish.co.uk/slang/): This site focuses on slang heard in the United Kingdom, which means slang from other places as well. It's browseable by letter or via a search engine. Words from outside the UK are marked with their place of origin in brackets. Definitions also indicate typical usage: humorous, vulgar, derogatory, etc.
Surfing for Slang (http://www.spraakservice.net/slangportal): Of course, each area in the world has its own slang. This site has a good metalist of English and Scandinavian slang resources.
Urban Dictionary (http://www.urbandictionary.com): You can browse this collaborative dictionary by word and find dozens or hundreds of definitions for each word. The definitions are added by site visitors, and each definition is open to votes from other visitors. The most widely accepted definitions for each word bubble up to the top.

Start by searching Google for your query without the slang. Check the results and decide where they're falling short. Are they not specific enough? Are they not located in the right geographical area? Are they not covering the right demographicteenagers, for example?

Introduce one slang word at a time. For example, in a search for football, add the word bonce and check the results. If they're not narrow enough, add the word bloke. Add one word at a time until you get the results you want. Using slang is an inexact science, so you have to do some experimenting.

Here are some things to be careful of when using slang in your searches:

Try many different slang words.
Don't use slang words that are generally considered offensive, except as a last resort. Your results will be skewed.
Be careful when using teenage slang, which changes constantly.
Try searching for slang when using Google Groups. Slang crops up often in conversation.
Minimize your searches for slang when searching for more formal sources, such as newspaper stories.
Don't use slang phrases if you can help it; in my experience, slang changes too much to be consistently searchable. Stick to established words.

Industrial Slang

Specialized vocabularies are those used in particular subject areas and industries. Good examples of specialized vocabularies are used in the medical and legal fields, although there are many others.

When you need to tip your search to the more technical, the more specialized, and the more in-depth, think of a specialized vocabulary. For example, do a Google search for heartburn. Now do a search for heartburn GERD. Now do a search for heartburn GERD gastric acid. You'll see that each is very different.

With some fields, finding specialized-vocabulary resources is a snap. But with others, it's not that easy. As a jumping-off point, try the Glossarist site at http://www.glossarist.com, which is a searchable subject index of about 6,000 different glossaries covering dozens of different topics. There are also several other large online resources covering certain specialized vocabularies. These resources include:

The On-Line Medical Dictionary (http://cancerweb.ncl.ac.uk/omd/)

This dictionary contains vocabulary relating to biochemistry, cell biology, chemistry, medicine, molecular biology, physics, plant biology, radiobiology, and other sciences and technologies. It currently has over 46,000 listings.

You can browse the dictionary by letter or search it by word. Sometimes you can search for a word that you know (bruise) and find another term that might be more common in medical terminology (contusion). You can also browse the dictionary by subject. Bear in mind that this dictionary is in the UK, and some spellings may be slightly different for American users (e.g., "tumour" versus "tumor").

MedTerms.com (http://www.medterms.com)

MedTerms.com has far fewer definitions (around 15,000), but it also has extensive articles from MedicineNet. If you're starting from absolute square one with your research and need some basic information and vocabulary to get started, search MedicineNet for your term (bruise works well) and then move to MedTerms.com to search for specific words.

Law.com's legal dictionary (http://dictionary.law.com/lookup2.asp)

Law.com's legal dictionary is excellent because you can search either words or definitions; you can browse, too. For example, you can search definitions for the word inheritance and get a list of all the entries that contain the word "inheritance." This is an easy way to get to the words "muniment of title" without knowing the path.

As with slang, add specialized vocabulary slowlyone word at a timeand anticipate that your search results will be narrowed very quickly. For example, take the word "spudding," often used in association with oil drilling. Searching for spudding by itself finds about 33,900 results on Google. Adding Texas knocks it down to 852 results, and this is still a very general search! Add specialized vocabulary very carefully, or you'll narrow your search results to the point where you can't find what you want.

Researching Terminology with Google

First things first: for heaven's sake, please don't just plug the abbreviation into the query box! For example, searching for XSLT will net you over 29 million results. While combing through the sites that Google turns up may eventually lead you to a definition, there's simply more to life than that. Instead, add "stands +for" to the query if it's an abbreviation or acronym. "XSLT stands +for" returns around 199,000 results, and the first is a tutorial glossary. If you're still getting too many results ("XML stands +for" gives you around six million results), try adding beginners or newbie to the query. "XML stands +for" beginners brings in 463 results, the fourth being a general, gentle "Introduction to XML."

If you're still not getting the results you want, try "What is X?" or " X +is short +for" or " X beginners FAQ", where X is the acronym or term. These should be regarded as second-tier methods, because most sites don't tend to use phrases such as "What is X?" on their pages, "X is short for" is uncommon language usage, and X might be so new (or so obscure) that it doesn't yet have a FAQ entry. Then again, your mileage may vary, and it's worth a shot; there's a lot of terminology out there.

If you have hardware- or software-specific, as opposed to hardware- or software-related, terminology, try the word or phrase along with anything you might know about its usage. For example, as a Perl module, DynaLoader is software-specific terminology. That much known, simply give the two words a spin:

DynaLoader Perl

If the results are too advanced, assuming you already know what a DynaLoader is, start playing with the words beginners, newbie, and the like to bring you closer to information for beginners:

DynaLoader Perl Beginners

If you still can't find the word in Google, there are a few possible causes: perhaps it's slang specific to your area, your coworkers are playing with your mind, you heard it wrong (or there's a typo on the printout you got), or it's very, very new.

Where to Go When It's Not on Google

Despite your best efforts, you're not finding good explanations of the terminology on Google. There are a few other sites that might have what you're looking for:

Whatis (http://whatis.techtarget.com): A searchable subject index of computer terminology, from software to telecom. This is especially useful if you have a hardware- or software-specific word because the definitions are divided into categories. You can also browse alphabetically. Annotations are good and are often cross-indexed.
Webopedia (http://www.pcwebopaedia.com): Searchable by keyword or browsable by category. This site also has a list of the newest entries on the front page so that you can check for new words.
Netlingo (http://www.netlingo.com): This site is more Internet-oriented. It shows up with a frame on the left that contains the words, with the definitions on the right. It includes lots of cross-referencing and really old slang.
Tech Encyclopedia (http://www.techweb.com/encyclopedia/): Features definitions and information for over 20,000 words. The top 10 terms searched for are listed so you can see if everyone else is as confused as you are. Though entries had before-the-listing and after-the-listing lists of words, I saw only moderate cross-referencing.
Wikipedia (http://www.wikipedia.com): This public encyclopedia that anyone can edit is surprisingly accurate and up to date with technology slang. Because new entries don't need to be approved by one or two editors, and because the work of editing is done by thousands of volunteers across disciplines and industries, Wikipedia is constantly evolving with the times.

Geek terminology proliferates almost as quickly as web pages. Don't worry too much about deliberately keeping up; it's just about impossible. Instead, use Google as a "ready reference" resource for definitions.

Google Phonebook: Let Google's Fingers Do the Walking

Google makes an excellent phonebook, even to the extent of doing reverse lookups.

Google combines residential and business phone number information and its own excellent interface to offer a phonebook lookup that provides listings for businesses and residences in the United States. However, the search offers three different syntaxes, different levels of information provide different results, the syntaxes are finicky, and Google doesn't provide documentation.

The Three Syntaxes

Google offers three ways to search its phonebook:

phonebook: Searches the entire Google phonebook
rphonebook: Searches residential listings only
bphonebook: Searches business listings only

Using the Syntaxes

Using a standard phonebook requires knowing quite a bit of information about what you're looking for: first name, last name, city, and state. Google's phonebook requires no more than last name and state to get started. Casting a wide net for all the Smiths in California is as simple as:

phonebook:smith ca

phonebook:john smith los angeles ca

At the time of this writing, the Google phonebook found 2 business and 20 residential listings for John Smith in Los Angeles, California.

Caveats

The phonebook syntaxes are powerful and useful, but they can be difficult to use if you don't remember a few things about how they work.

Syntaxes are case-sensitive

Searching for phonebook:john doe ca works, while Phonebook:john doe ca (notice the capital P) doesn't.

Wildcards don't work

Then again, they're not needed, since the Google phonebook does all the wildcarding for you. For example, if you want to find shops in New York with "Coffee" in the title, don't bother trying to envision every permutation of "Coffee Shop," "Coffee House," and so on. Just search for bphonebook:coffee new york ny and you'll get a list of all businesses in New York whose names contain the word "coffee."

Exclusions don't work

Perhaps you want to find coffee shops that aren't Starbucks. You might think phonebook:coffee -starbucks new york ny would do the trick. After all, you're searching for coffee and not Starbucks, right? Unfortunately not; Google thinks you're looking for both the words "coffee" and "starbucks," yielding just the opposite of what you were hoping for: everything Starbucks in NYC.

OR doesn't always work

You might be wondering if Google's phonebook accepts OR lookups. You then might experiment, trying to find all the coffee shops in Rhode Island or Hawaii: bphonebook:coffee (ri hi). Unfortunately, that doesn't work; the only listings you'll get are for coffee shops in Hawaii. This is because Google doesn't see the (ri hi) as a state code, but rather as another element of the search.

So, if you reverse the previous search and search for coffee (hi ri), Google would find listings that contain the word "coffee" and either the strings "hi" or "ri." This means you'll find Hi-Tide Coffee (in Massachusetts) and several coffee shops in Rhode Island.

It's neater to use OR in the middle of your query and specify a state at the end. For example, if you want to find coffee shops that sell either donuts or bagels, this query works fine: bphonebook:coffee (donuts bagels) ma. It finds stores in Massachusetts that contain the word "coffee" and either the word "donuts" or the word "bagels." The bottom line: you can use an OR query on the store or resident name, but not on the location.

Reverse Phonebook Lookup

All three phonebook syntaxes support reverse lookup, though it's probably best to use the general phonebook: syntax to avoid not finding what you're looking for due to a residential or business classification.

To do a reverse search, just enter the phone number with area code. Lookups without area code won't work:

phonebook:(707) 827-7000