jWeb1T - java Web 1T 5-gram Searcher

User's Guide

Table of Contents

Introduction

jWeb1T is an open source Java tool for efficiently searching the Web 1T 5-gram corpus. It is based on a binary search algorithm that finds the n-grams in a specific file and returns their frequency counts in logarithmic time. As the corpus is stored in many files a simple index is used to retrive the files containing the n-grams that start with the specific prefix. The corpus must be installed and uncopressed on a hard drive (approx. 90 GB).

jWeb1T has been used in the FBK-irst system for the English Lexical Substitution Task at SemEval 2007 [1].

License

jWeb1T is released as free software with full source code, provided under the terms of the Apache License, Version 2.0.

System Requirements

The jWeb1T software is available on all platforms supporting Java 2.

Installation

  1. Make sure you have installed Sun's Java 2 Environment. The full version is available at http://java.sun.com/j2se. If you are limited by disk space or bandwidth, install the smaller, run-time only version at http://java.sun.com/j2se.

  2. Download the jar file of the installer here. Copy the file into the directory where you want to install the program and then run:

    jar -xvf jweb1t-1.10.jar

Dependencies

jWeb1T uses elements of the Java 2 API such as collections, and therefore building requires the Java 2 Standard Edition SDK (Software Development Kit). To run jWeb1T, the Java 2 Standard Edition RTE (Run Time Environment) is required (or you can use the SDK, of course).

jWeb1T is also dependent upon a few packages for general functionality. They are included in the lib directory for convenience, but the default build target does not include them. If you use the default build target, you must add the dependencies to your classpath.

Using a C Shell run:

setenv CLASSPATH dist/web1t.jar
setenv CLASSPATH ${CLASSPATH}:lib/log4j-1.2.8.jar
setenv CLASSPATH ${CLASSPATH}:lib/commons-digester.jar
setenv CLASSPATH ${CLASSPATH}:lib/commons-logging.jar
setenv CLASSPATH ${CLASSPATH}:lib/commons-collections.jar

Running jWeb1T

The Web 1T 5-gram corpus must be installed and uncopressed on a hard drive (approx. 60 GB). The n-grams files must be stored in different directories, one for 1-grams, one for 2-grams and so on. For example, this is the list for the 2-grams:

shell>ls -laFh 2gms
total 5.0G
drwxr-xr-x    2 giuliano  tcc          4.0K Dec 18  2006 ./
dr-xr-xr-x    5 giuliano  tcc          4.0K Aug 23  2006 ../
-r--r--r--    1 giuliano  tcc          131M Aug 22  2006 2gm-0000
-r--r--r--    1 giuliano  tcc          143M Aug 22  2006 2gm-0001
-r--r--r--    1 giuliano  tcc          148M Aug 22  2006 2gm-0002
-r--r--r--    1 giuliano  tcc          148M Aug 22  2006 2gm-0003

...

-r--r--r--    1 giuliano  tcc          179M Aug 22  2006 2gm-0027
-r--r--r--    1 giuliano  tcc          176M Aug 22  2006 2gm-0028
-r--r--r--    1 giuliano  tcc          172M Aug 22  2006 2gm-0029
-r--r--r--    1 giuliano  tcc          170M Aug 22  2006 2gm-0030
-r--r--r--    1 giuliano  tcc           71M Aug 22  2006 2gm-0031
-r--r--r--    1 giuliano  tcc           771 Aug 22  2006 2gm.idx

Note that the n-grams are stored on several DVD. Make sure that all files are uncompressed and copied in the destination directories. The 1-grams are stored in one file (vocab).

Indexing is called with the following parameters:

java org.fbk.irst.tcc.web1t.CreateFileMap n-gram-dir index-file

Arguments:

n-gram-dirdirectory containing the n-gram files
index-filefile in which to store resulting index

For example, the following command is used to create the index for 2-grams files stored in the directory 2gms and the result is stored in index-2gms:

java org.fbk.irst.tcc.web1t.CreateFileMap 2gms/ index-2gms

The Index must be built for all n-grams, (1..5). The index file must be spefified to jWeb1T through the config file web1t-config.xml. An example of config file is as follows:

<web1t-config>
  <index-list>
    <index>
      <ngram>1</ngram>
      <file-name>index-1gms</file-name>
    </index>
    <index>
      <ngram>2</ngram>
      <file-name>index-2gms</file-name>
    </index>
    <index>
      <ngram>3</ngram>
      <file-name>index-3gms</file-name>
    </index>
    <index>
      <ngram>4</ngram>
      <file-name>index-4gms</file-name>
    </index>
    <index>
      <ngram>5</ngram>
      <file-name>index-5gms</file-name>
    </index>
  </index-list>

Searching is called with the following parameters:

java org.fbk.irst.tcc.web1t.FullSearch (n-gram)+

Arguments:

n-grama list of n-grams

For example, the following command is used to search the 4-gram "After installing the software" using the command line:

java org.fbk.irst.tcc.web1t.FullSearch "After installing the software"

[main] INFO  org.fbk.irst.tcc.web1t.FullSearch  - f('After installing the software') = 4564, found in 139 ms

The following code fragment shows a search example using the API:

import org.fbk.irst.tcc.web1t.FullSearch;
import org.apache.log4j.Logger;
import java.io.*;


...

class Example {

...

public void search(String t) throws Exception {
  FullSearch search = new FullSearch();		
  long f = search.getFreq(t);
  logger.info("f('" + args[i] + "') = " + f + ", found in " + (end - begin) + " ms");
} // end search

...

} // end class Example

How to cite our work

Please cite the following document:

Claudio Giuliano. jWeb1T: a library for searching the Web 1T 5-gram corpus, 2007. Software available at http://tcc.itc.it/research/textec/tools-resources/jweb1t.html.

The bibtex format is as follows
@Manual{Giuliano:2007fk,
 author = {Claudio Giuliano},
 title = {j{W}eb1{T}: a library for searching the {W}eb 1{T} 5-gram corpus},
 year =	{2007},
 note =	{Software available at \url{http://tcc.itc.it/research/textec/tools-resources/jweb1t.html}}
}

Bibliography

[1] Claudio Giuliano, Alfio Gliozzo and Carlo Strapparava. FBK-irst: Lexical Substitution Task Exploiting Domain and Syntagmatic Coherence . In Proceedings of the 4th Interational Workshop on Semantic Evaluations (SemEval-2007), Prague, 23-24 June 2007.

Acknowledgments

jWeb1T has been funded by X-Media Project.