Workshop: Week 2

PRELIM 1. Subversion setup.
===========================

Please make sure that you have Subversion working by the end of this
workshop, and talk to me if you have any problems.

Please remember: DO NOT ADD DATA FILES TO SUBVERSION!  I'll end up
with a huge repository if this happens.  Have a look at the 
documentation for your svn client for ignore files by pattern.  On
my Linux system, the setting "global-ignores" in the file
"~/.subversion/config" does this.  Setting:

   global-ignores = *.dat *.db

will prevent subversion adding any files with the extension ".dat"
or ".db".


PRELIM 2.  Code.
================

Code for this week's workshop can be downloaded from:

    http://www.williamwebber.com/research/teaching/comp90042/2014s1/wksh/wk02/code.tar.gz

This contains the following files:

    coll.py   -- collection wrapper (from last week)
    mkind.py  -- example code for building a simple index
    qryind.py -- example code for querying an index

You may also want to reuse some of the code you wrote for the
last worksheet (or else the answers I provided for that worksheet, on
the subject page).


PRELIM 3.  Db implementations.
==============================

A "db" here is a persistent (that is, on-disk) persistent
key:value database.  Standard db implementations include bdb,
dbm, and gdbm.  Python provides an abstract "anydbm" interface,
which works with whichever db implementation you have installed
on your system (bdb, dbm, gdbm).  If you have no such
implementation, "anydbm" falls back on a simple (and much slower)
native Python implementation, called "dumbdbm".   Your performance
will vary depending upon whether you have a db implementation
installed or not, and you may want to figure out how to install
one on your system.


PRELIM 4.  Shelve.
================== 

We will be using the Python "shelve" module for building indexes in
this workshop; this module sits on top of "anydbm".  A "shelve"
object has an interface very similar to a Python dictionary, except
that objects are saved in persistent format to a file.  See mkind.py
and qryind.py for examples of usage.  However, it can be very slow to
modify shelved objects.  When building document or term vectors, build
the full vector first, then add it to the shelf; don't create an
empty vector, add it to the shelf, then add elements to the vector.
That is, do something like:

    shelf = shelve.open(filename)
    for doc in coll:
       docvec = {}
       for (term, freq) in doc:
          docvec[term] = freq
       shelf[doc.get_docid()] = docvec
    shelf.close()

Do NOT do:

    shelf = shelve.open(filename)
    for doc in coll:
       shelf[doc.get_docid()] = {}
       for (term, freq) in doc:
          shelf[doc.get_docid()][term] = freq   # ***
    shelf.close()

because the line marked "***" will be constantly loading, modifying,
and storing data to and from disk.


PRELIM 5.  Data.
================

We'll continue using the LYRL dataset from last week.  Again, DON'T
add this dataset, or any indexes built upon it, to subversion.

If you want to see the actual documents underlying a document ID,
I've set up a web service for this.  If the document ID is '46825',
and your Subversion user id is 'foo', then you can fetch the document
from:

    http://www.williamwebber.com/rcv1v2/?docid=46825&user=foo

Note that each user can only user this service a maximum of 1,000
times; asking for the one document multiple times counts multiply.


=====================================================================

Question 1: Indexes
===================

To date, we've been re-parsing, re-scoring, and re-normalizing the 
entire collection every time we run a query.  This is obviously wasteful.
It is better to do the parsing and processing once, store the results
to an index, and then query the index for the data we need.  

In the provided code, "mkind.py" gives an example of creating a simple
index, and "qryind.py" of querying it.  Read these programs to
understand what they do, then run them to make sure they work.  On
my Linux system, they might be run like so:

   python mkind.py ../dat/lyrl_tokens_30k.dat ind.db
   python qryind.py ind.db 26151 26152

Extend your (or my) solution to Question 4 ("Length normalization")
to build an index of length-normalized document vectors.  Then 
reimplement Question 5 ("Similarity computation") based upon this
index.  Compare the timings with the index to the timings with the
original program.  (On Linux, this can be done with the "time"
command:

   time python pairsim.py ind.db 26151 26152

Or else see:

   http://stackoverflow.com/questions/1557571/how-to-get-time-of-a-python-program-execution

for how to do the timing directly in Python.)


Question 2: Query on doc:docvec index
=====================================

Write code that takes a query (either from the command line, or
standard input; whatever you prefer), and then evaluates it against
the doc:docvec index built in Question 2.  To do this, create a
pseudo- document vector for the query.  For instance, if the
query were:

     jaguar car race

then the pseudo- document vector would be:

     { "jaguar": 1, "car": 1, "race": 1 }

Then iterate over the documents in the index, calculating the
cosine distance between each of them and the query-pseudo vector.
Finally, return the document that is most similar to the query
(or, if you prefer, the top ten documents, in decreasing similarity).

NOTE: we're eliding a subtle point here.  In order to run the
query against the index, we need to stem the query terms in the
same way that the index is stemmed.  For now, you'll have to manually
figure out what the correct stemming is, and enter queries with the
stemmed terms.  ("jaguar", "car", and "race" are all the same stemmed
and unstemmed.)

What document has the highest similarity with the query "jaguar car
race"?  What is the similarity score?  And how long does it take your
program to run?


Question 3: Build an inverted index
===================================

Write code that builds an inverted index.  The idea of the inversion
can be expressed in Python-esque pseudo code:

    inv_idx = {}
    for (docid, docvec) in coll:
        for (term, weight) in docvec:
            inv_idx[term][docid] = weight

That is, we go from an outer dictionary with docid keys and an inner
with termid keys, to an outer dictionary with termid and an inner
with docid keys.   Make sure you perform the TF*IDF and
document-length normalizations before adding the term vectors to
the inverted index.

Implement the inverted index so that the index itself is stored
in a shelve object.  Build such an index for the LYRL 30k dataset.
What is the size of the index?  How large is it compared to the
original data set?


Question 4: Querying on an inverted index
=========================================

Redo Question 2, but this time run the query against the inverted
index.  (Congratulations; you've just written a search engine!)
Do you get the same result and score?  (You should.)  How
long does the query take to run this time, compared to the
implementation in Question 2?


Question 5: Term similarity
===========================

In the lecture, we discussed calculating term similarity using
the TDM, as the distance between terms in document space.  This
similarity computation can be implemented on top of an inverted
index (in the same way that a document similarity computation
is implemented on a straightforward doc:docvec index).  Using
the inverted index you created in Question 3, with the same
$w_{d,t}$ scores you calculated for it (that is, document-length
normalized TF*IDF), find the top 15 suggestions for the
terms "socc", "jaguar", and "najibullah".  Compare these with
the term similarities calculated in the lecture (one of which used plain
term frequency as weights, the other of which used term-length
normalization).  Which similarity lists do you prefer?  Does
document-length normalization make any sense here?  What about IDF?
Does it lead to any incongruous results in the "similar" terms?


Question 6: Pseudo-relevance feedback
=====================================

Implement the Rocchio algorithm described at the end of the
fourth lecture.  Take the top 5 results as feedback documents.  
Note that you will need an inverted index to do the initial 
query efficiently, but then a doc:docvec index to extract the
document vectors to enhance the query vector.  The enhanced
query vector can then be evaluated by comparing for similarity
with the each document vector in the doc:docvec index.  What
is the computational complexity of this last operation?  Can
you think of a way to speed it up?