jWeb1T - java Web 1T 5-gram Searcher | |
User's Guide |
jWeb1T is an open source Java tool for efficiently searching the Web 1T 5-gram corpus. It is based on a binary search algorithm that finds the n-grams in a specific file and returns their frequency counts in logarithmic time. As the corpus is stored in many files a simple index is used to retrive the files containing the n-grams that start with the specific prefix. The corpus must be installed and uncopressed on a hard drive (approx. 90 GB).
jWeb1T has been used in the FBK-irst system for the English Lexical Substitution Task at SemEval 2007 [1].
jWeb1T is released as free software with full source code, provided under the terms of the Apache License, Version 2.0.
The jWeb1T software is available on all platforms supporting Java 2.
Make sure you have installed Sun's Java 2 Environment. The full version is available at http://java.sun.com/j2se. If you are limited by disk space or bandwidth, install the smaller, run-time only version at http://java.sun.com/j2se.
Download the jar file of the installer here. Copy the file into the directory where you want to install the program and then run:
jar -xvf jweb1t-1.10.jar
jWeb1T uses elements of the Java 2 API such as collections, and therefore building requires the Java 2 Standard Edition SDK (Software Development Kit). To run jWeb1T, the Java 2 Standard Edition RTE (Run Time Environment) is required (or you can use the SDK, of course).
jWeb1T is also dependent upon a few packages for general functionality. They
are included in the lib directory for convenience, but the default build
target does not include them. If you use the default build target,
you must add the dependencies to your classpath.
Using a C Shell run:
setenv CLASSPATH dist/web1t.jar
setenv CLASSPATH ${CLASSPATH}:lib/log4j-1.2.8.jar
setenv CLASSPATH ${CLASSPATH}:lib/commons-digester.jar
setenv CLASSPATH ${CLASSPATH}:lib/commons-logging.jar
setenv CLASSPATH ${CLASSPATH}:lib/commons-collections.jar
The Web 1T 5-gram corpus must be installed and uncopressed on a hard drive (approx. 60 GB). The n-grams files must be stored in different directories, one for 1-grams, one for 2-grams and so on. For example, this is the list for the 2-grams:
shell>ls -laFh 2gms total 5.0G drwxr-xr-x 2 giuliano tcc 4.0K Dec 18 2006 ./ dr-xr-xr-x 5 giuliano tcc 4.0K Aug 23 2006 ../ -r--r--r-- 1 giuliano tcc 131M Aug 22 2006 2gm-0000 -r--r--r-- 1 giuliano tcc 143M Aug 22 2006 2gm-0001 -r--r--r-- 1 giuliano tcc 148M Aug 22 2006 2gm-0002 -r--r--r-- 1 giuliano tcc 148M Aug 22 2006 2gm-0003 ... -r--r--r-- 1 giuliano tcc 179M Aug 22 2006 2gm-0027 -r--r--r-- 1 giuliano tcc 176M Aug 22 2006 2gm-0028 -r--r--r-- 1 giuliano tcc 172M Aug 22 2006 2gm-0029 -r--r--r-- 1 giuliano tcc 170M Aug 22 2006 2gm-0030 -r--r--r-- 1 giuliano tcc 71M Aug 22 2006 2gm-0031 -r--r--r-- 1 giuliano tcc 771 Aug 22 2006 2gm.idx
Note that the n-grams are stored on several DVD. Make sure that all files are uncompressed and copied in the destination directories. The 1-grams are stored in one file (vocab).
Indexing is called with the following parameters:
java org.fbk.irst.tcc.web1t.CreateFileMap n-gram-dir index-file
Arguments:
n-gram-dir → directory containing the n-gram files index-file → file in which to store resulting index
For example, the following command is used to create the index for 2-grams files stored in the directory 2gms and the result is stored in index-2gms:
java org.fbk.irst.tcc.web1t.CreateFileMap 2gms/ index-2gms
The Index must be built for all n-grams, (1..5). The index file must be spefified to jWeb1T through the config file web1t-config.xml.
An example of config file is as follows:
<web1t-config>
<index-list>
<index>
<ngram>1</ngram>
<file-name>index-1gms</file-name>
</index>
<index>
<ngram>2</ngram>
<file-name>index-2gms</file-name>
</index>
<index>
<ngram>3</ngram>
<file-name>index-3gms</file-name>
</index>
<index>
<ngram>4</ngram>
<file-name>index-4gms</file-name>
</index>
<index>
<ngram>5</ngram>
<file-name>index-5gms</file-name>
</index>
</index-list>
Searching is called with the following parameters:
java org.fbk.irst.tcc.web1t.FullSearch (n-gram)+
Arguments:
n-gram → a list of n-grams
For example, the following command is used to search the 4-gram "After installing the software" using the command line:
java org.fbk.irst.tcc.web1t.FullSearch "After installing the software"
[main] INFO org.fbk.irst.tcc.web1t.FullSearch - f('After installing the software') = 4564, found in 139 ms
The following code fragment shows a search example using the API:
import org.fbk.irst.tcc.web1t.FullSearch;
import org.apache.log4j.Logger;
import java.io.*;
...
class Example {
...
public void search(String t) throws Exception {
FullSearch search = new FullSearch();
long f = search.getFreq(t);
logger.info("f('" + args[i] + "') = " + f + ", found in " + (end - begin) + " ms");
} // end search
...
} // end class Example
Please cite the following document:
Claudio Giuliano. jWeb1T: a library for searching the Web 1T 5-gram corpus, 2007. Software available at http://tcc.itc.it/research/textec/tools-resources/jweb1t.html.
The bibtex format is as follows
@Manual{Giuliano:2007fk,
author = {Claudio Giuliano},
title = {j{W}eb1{T}: a library for searching the {W}eb 1{T} 5-gram corpus},
year = {2007},
note = {Software available at \url{http://tcc.itc.it/research/textec/tools-resources/jweb1t.html}}
}
[1] Claudio Giuliano, Alfio Gliozzo and Carlo Strapparava. FBK-irst: Lexical Substitution Task Exploiting Domain and Syntagmatic Coherence . In Proceedings of the 4th Interational Workshop on Semantic Evaluations (SemEval-2007), Prague, 23-24 June 2007.
jWeb1T has been funded by X-Media Project.