Index of /obi/Roget

      Name                   Last modified     Size  Description

[DIR] Parent Directory 15-Aug-07 18:58 - [   ] THE7850 28-Dec-91 18:51 0k [   ] Thesaurus.1911 20-Dec-91 17:51 1.3M [   ] offsets.dir 28-Dec-91 19:02 4k [   ] offsets.pag 28-Dec-91 19:02 16k [TXT] roget.txt 20-Dec-91 17:51 1.3M [CMP] software.Z 28-Dec-91 18:47 25k [   ] th 12-Nov-92 02:35 5k [   ] thesaurus.el 28-Dec-91 18:49 4k [   ] word-index 28-Dec-91 19:02 481k

[ IMPORTANT --> Be sure to read the section on requirements! ]

     This is a very ALPHA-TEST implementation of a thesaurus for GNU
Emacs.  Although it is not complete, I'm not sure when or if I'll have
the time to spiff it up.  As a result, I'm posting what I have here (is
anyone else working on something similar?).  It's copyrighted and is
being released under the GNU Public License (see the end of this file
for more details).  Note that only this interface falls under the GNU
Public License; the thesaurus itself has a completely separate and
independent "copyright".

     The Emacs-Lisp functions in this package allow you to query a
thesaurus for synonyms of a word.  For example, you can ask Emacs to
quickly display a thesaurus entry for "editor":

-------------------------------------------------------------------------------
***** Word: editor

     #593. Book. -- N. booklet; writing, work, volume, tome, opuscule;
tract, tractate; livret; brochure, libretto, handbook, codex, manual,
pamphjlet, enchiridion, circular, publication; chap book.
     part, issue, number livraison; album, portfolio; periodical, serial,
magazine, ephemeris, annual, journal.
     paper, bill, sheet, broadsheet; leaf, leaflet; fly leaf, page; quire,
ream
     chapter, section head, article paragraph, passage, clause.
     folio, quarto, octavo; duodecimo, sextodecimo, octodecimo.
     encyclopedia; encompilation;  library, bibliotheca; press &c.
(publication) 531.
     writer, author, litterateur, essayist, journalism; pen, scribbler, the
scribbling race; literary hack, Grub-street writer; writerr for the press,
gentleman of the press, representative of the press; adjective jerker,
diaskeaus, ghost, hack writer, ink slinger; publicist; reporter, penny a
liner; editor, subeditor; playwright &c. 599; powt &c. 597.
     bookseller, publisher; bibliopole, bibliopolist; librarian; bookstore,
bookseller's shop.
     knowledge of books, bibliography; book learning &c. (knowledge) 490.
     Phr. "among the giant fossils of my past" [E. B. Browning]; craignez
tout d'un auteur en courroux; "for authors nobler palms remain" [Pope]; "I
lived to write and wrote to live" [Rogers]; "look in thy heart and write"
[Sidney]; "there is no Past so long as Books shall live" [Bulwer Lytton);
"the public mind is the creation of the Master-Writers" [Disraeli]; volumes
that I prize above my dukedom" [Tempest].
-------------------------------------------------------------------------------


*******************************************************************************
***** REQUIREMENTS:

     To use this, you need the following (besides the files that came
with this README file):

* A copy of the thesaurus itself (which is not included with this README
  file).  Thanks to Project Gutenberg, a copy of the 1911 Roget's
  Thesaurus has been made available via anonymous ftp from
  mrcnext.cso.uiuc.edu [ 128.174.201.12 ] (please ftp the file during
  off-hours -- at times OTHER THAN 10:00 AM to 6:00 PM Central Standard
  Time (Daylight in summer)).  It's in the directory "/etext":

	-rw-r--r--  1 24       micro    1377400 Jun 19 18:08 roget11.txt
	-rw-r--r--  1 24       micro     592247 Jun 19 18:13 roget11.zip

  You only need one of these, as roget11.zip is roget11.txt in a .ZIP
  file.  Note, however, the size.

* A copy of Perl 4.0, compiled with dbm/ndbm support, as the thesaurus
  indexing and low-level access routines are written as Perl scripts
  (this was done to avoid having to load the entire 1.3MB thesaurus into
  Emacs, bloating its process size).  Part of the index is stored as a
  dbm database, and so dbm/ndbm support must be compiled into Perl.

* While building the index (an index must be built from the raw
  thesaurus data), it is recommended that your system have plenty of
  free RAM and swap space, as a single 10-12 megabyte process is created
  during the indexing process.  Once the index is created, you need much
  less resources to access the thesaurus.

* You need about two megabytes of free disk space.  The thesaurus
  occupies about 1.3MB, and the index files occupies another half
  megabyte or so.


     Installation instructions are mentioned below.


*******************************************************************************
***** USAGE:

     The GNU Emacs interface provides three functions:

	thesaurus-lookup-word
	     This function will prompt for a word to look up, and all entries
	     that begin with this word will be displayed.  To display the
	     entry that contains only this word, specify a prefix.
	
	thesaurus-lookup-word-in-text
	     This function will extract the word under the cursor and run
	     `thesaurus-lookup-word' upon it.  A prefix can be specified to
	     force the display of only the entry that contains this word.
	
	thesaurus-show-words
	     This function will prompt for a word and will display all words
	     in the thesaurus that begin with this word.

These functions should be bound to some key sequences; however, this
package does not do this.  You'll have to do it yourself.

     There is also a shell-command-line interface to the thesaurus
(which is what the GNU Emacs interface uses).  Using the "th" Perl
script, you can query the thesaurus for a number of things:

	th <word> [<word> ...]
		Search the thesaurus for all entries that begin with
		"<word>".  Multiple words can be specified here.

	th -V <word> [<word> ...]
		Search the thesaurus for all entries that begin with
		"<word>".  All displayed entries are separated by a line
		of dashes.

	th -W <word> [<word> ...]
		Search the thesaurus for the entry that contains
		"<word>" exactly.

	th -w <word> [<word> ...]
		Display all words in the thesaurus that begin with
		"<word>".

	th -w -v <word> [<word> ...]
		Display all words in the thesaurus that begin with
		"<word>".  Alongside each word, the numbers of the
		entries that contain the word are displayed.

	th -n <number>
		Display thesaurus entry number "<number>".  Unlike a
		word, only one number can be specified.

Generally, you will want to pipe the output to more(1) or less(1).



*******************************************************************************
***** PROBLEMS:

     Error handling needs work.  Nothing is output if a word is not
found in the thesaurus.

     The scripts are simple-minded, and occasionally "screw-up"
(fortunately, this seems to be rare).

     There are typos in the thesaurus, which can cause the scripts to
mis-index very small parts of the thesaurus.

     The scripts used to build the indices are inefficient and are
unbelievably poorly written.  Fortunately, this doesn't really matter,
as the index creation process is a one-time task.  Looking up words
in the thesaurus is quite fast.

     The thesaurus is stored in an uncompressed form.  I thought about
breaking the thesaurus apart and storing each entry as a separate,
compressed, file, but this method loses some information in the
thesaurus (which cannot currently be accessed by these routines).  It
might be interesting to try compiling Perl with GNU dbm and storing the
entire thesaurus as a monolithic gdbm database (one entry per datum).


*******************************************************************************
***** INSTALLATION INSTRUCTIONS:

     The following assumes that you are familiar with Emacs and that you
have installed a copy of Perl 4.0.

1. Create a directory to hold all of the files.

2. Copy the files that came with this README file into that directory.

3. Copy the thesaurus into that directory.

4. cd to that directory.

5. Link the thesaurus to the name "roget.txt".  For example, if the
   copied thesaurus is called "roget11.txt", you can use a symbolic link
   (if your system supports symbolic links):

	ln -s roget11.txt roget.txt

   or you can use hard links:

	ln roget11.txt roget.txt

6. Run the script "makeindex".  This script runs the other scripts to
   build the index files.  On most modern machines with adequate
   resources, it'll take about 5-10 minutes to run (less, on fast
   machines, and more, on slow machines).  Note that a single 10-12
   megabyte process is created during the indexing procedures, so be
   sure that you have plenty of free RAM (otherwise, you'll go into swap
   h*ll, and this procedure could take hours).  Once everything is done,
   three files will be created:

	-rw-r--r--   1 root     other       4096 Dec 17 20:50 offsets.dir
	-rw-r--r--   1 root     other      16384 Dec 17 20:50 offsets.pag
	-rw-r--r--   1 root     other     492733 Dec 17 20:49 word-index

   (The file sizes may be different on your machine.)

   The files that begin with "offset" comprise a dbm/nbdm database of
   thesaurus file byte offsets versus entry number (i.e., for a given
   entry number, the corresponding file byte offset of the beginning of
   that entry is stored).  This means that, if the thesaurus file is
   ever edited or changed, you MUST re-execute the "makeindex" script to
   rebuild the indices.

   The file "word-index" is an ASCII text file of words and entry
   numbers, e.g.:

	creditor: 805
	creed: 484
	creek: 198 343 348

   In this example, the word "creditor" is mentioned in entry #805,
   "creed" is mentioned in entry #484, and "creek" is mentioned in
   entries #198, #343, and #348.

7. Edit the file "th", and edit the line (around line 69):

	$thesaurus_dir = "/usr/local/lib/roget";

   Change "/usr/local/lib/roget" to point to the directory containing
   the thesaurus and index files.

8. Add this directory to your $PATH.  If you don't, you won't be able to
   run the "th" command (Emacs needs this).

9. Edit your .emacs file and add this directory to your load-path.  Also
   add a line like the following:

	(load-library "thesaurus")

10. That's it.


*******************************************************************************
***** Legal foo:

-------------------------------------------------------------------------------
These thesaurus indexing and accessing routines are copyrighted.
Copyright (C) 1991 Darryl Okahata (darrylo@sr.hp.com)

NOTE THAT THE THESAURUS ITSELF HAS A COMPLETELY SEPARATE AND INDEPENDENT
"COPYRIGHT".  SEE THE THESAURUS FOR DETAILS.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 1, or (at your option)
any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
-------------------------------------------------------------------------------

     -- Darryl Okahata
	Internet: darrylo@sr.hp.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion or policy of Hewlett-Packard or of the
little green men that have been following him all day.