Workshop: Week 2 PRELIM 1. Subversion setup. =========================== Please make sure that you have Subversion working by the end of this workshop, and talk to me if you have any problems. Please remember: DO NOT ADD DATA FILES TO SUBVERSION! I'll end up with a huge repository if this happens. Have a look at the documentation for your svn client for ignore files by pattern. On my Linux system, the setting "global-ignores" in the file "~/.subversion/config" does this. Setting: global-ignores = *.dat *.db will prevent subversion adding any files with the extension ".dat" or ".db". PRELIM 2. Code. ================ Code for this week's workshop can be downloaded from: http://www.williamwebber.com/research/teaching/comp90042/2014s1/wksh/wk02/code.tar.gz This contains the following files: coll.py -- collection wrapper (from last week) mkind.py -- example code for building a simple index qryind.py -- example code for querying an index You may also want to reuse some of the code you wrote for the last worksheet (or else the answers I provided for that worksheet, on the subject page). PRELIM 3. Db implementations. ============================== A "db" here is a persistent (that is, on-disk) persistent key:value database. Standard db implementations include bdb, dbm, and gdbm. Python provides an abstract "anydbm" interface, which works with whichever db implementation you have installed on your system (bdb, dbm, gdbm). If you have no such implementation, "anydbm" falls back on a simple (and much slower) native Python implementation, called "dumbdbm". Your performance will vary depending upon whether you have a db implementation installed or not, and you may want to figure out how to install one on your system. PRELIM 4. Shelve. ================== We will be using the Python "shelve" module for building indexes in this workshop; this module sits on top of "anydbm". A "shelve" object has an interface very similar to a Python dictionary, except that objects are saved in persistent format to a file. See mkind.py and qryind.py for examples of usage. However, it can be very slow to modify shelved objects. When building document or term vectors, build the full vector first, then add it to the shelf; don't create an empty vector, add it to the shelf, then add elements to the vector. That is, do something like: shelf = shelve.open(filename) for doc in coll: docvec = {} for (term, freq) in doc: docvec[term] = freq shelf[doc.get_docid()] = docvec shelf.close() Do NOT do: shelf = shelve.open(filename) for doc in coll: shelf[doc.get_docid()] = {} for (term, freq) in doc: shelf[doc.get_docid()][term] = freq # *** shelf.close() because the line marked "***" will be constantly loading, modifying, and storing data to and from disk. PRELIM 5. Data. ================ We'll continue using the LYRL dataset from last week. Again, DON'T add this dataset, or any indexes built upon it, to subversion. If you want to see the actual documents underlying a document ID, I've set up a web service for this. If the document ID is '46825', and your Subversion user id is 'foo', then you can fetch the document from: http://www.williamwebber.com/rcv1v2/?docid=46825&user=foo Note that each user can only user this service a maximum of 1,000 times; asking for the one document multiple times counts multiply. ===================================================================== Question 1: Indexes =================== To date, we've been re-parsing, re-scoring, and re-normalizing the entire collection every time we run a query. This is obviously wasteful. It is better to do the parsing and processing once, store the results to an index, and then query the index for the data we need. In the provided code, "mkind.py" gives an example of creating a simple index, and "qryind.py" of querying it. Read these programs to understand what they do, then run them to make sure they work. On my Linux system, they might be run like so: python mkind.py ../dat/lyrl_tokens_30k.dat ind.db python qryind.py ind.db 26151 26152 Extend your (or my) solution to Question 4 ("Length normalization") to build an index of length-normalized document vectors. Then reimplement Question 5 ("Similarity computation") based upon this index. Compare the timings with the index to the timings with the original program. (On Linux, this can be done with the "time" command: time python pairsim.py ind.db 26151 26152 Or else see: http://stackoverflow.com/questions/1557571/how-to-get-time-of-a-python-program-execution for how to do the timing directly in Python.) Question 2: Query on doc:docvec index ===================================== Write code that takes a query (either from the command line, or standard input; whatever you prefer), and then evaluates it against the doc:docvec index built in Question 2. To do this, create a pseudo- document vector for the query. For instance, if the query were: jaguar car race then the pseudo- document vector would be: { "jaguar": 1, "car": 1, "race": 1 } Then iterate over the documents in the index, calculating the cosine distance between each of them and the query-pseudo vector. Finally, return the document that is most similar to the query (or, if you prefer, the top ten documents, in decreasing similarity). NOTE: we're eliding a subtle point here. In order to run the query against the index, we need to stem the query terms in the same way that the index is stemmed. For now, you'll have to manually figure out what the correct stemming is, and enter queries with the stemmed terms. ("jaguar", "car", and "race" are all the same stemmed and unstemmed.) What document has the highest similarity with the query "jaguar car race"? What is the similarity score? And how long does it take your program to run? Question 3: Build an inverted index =================================== Write code that builds an inverted index. The idea of the inversion can be expressed in Python-esque pseudo code: inv_idx = {} for (docid, docvec) in coll: for (term, weight) in docvec: inv_idx[term][docid] = weight That is, we go from an outer dictionary with docid keys and an inner with termid keys, to an outer dictionary with termid and an inner with docid keys. Make sure you perform the TF*IDF and document-length normalizations before adding the term vectors to the inverted index. Implement the inverted index so that the index itself is stored in a shelve object. Build such an index for the LYRL 30k dataset. What is the size of the index? How large is it compared to the original data set? Question 4: Querying on an inverted index ========================================= Redo Question 2, but this time run the query against the inverted index. (Congratulations; you've just written a search engine!) Do you get the same result and score? (You should.) How long does the query take to run this time, compared to the implementation in Question 2? Question 5: Term similarity =========================== In the lecture, we discussed calculating term similarity using the TDM, as the distance between terms in document space. This similarity computation can be implemented on top of an inverted index (in the same way that a document similarity computation is implemented on a straightforward doc:docvec index). Using the inverted index you created in Question 3, with the same $w_{d,t}$ scores you calculated for it (that is, document-length normalized TF*IDF), find the top 15 suggestions for the terms "socc", "jaguar", and "najibullah". Compare these with the term similarities calculated in the lecture (one of which used plain term frequency as weights, the other of which used term-length normalization). Which similarity lists do you prefer? Does document-length normalization make any sense here? What about IDF? Does it lead to any incongruous results in the "similar" terms? Question 6: Pseudo-relevance feedback ===================================== Implement the Rocchio algorithm described at the end of the fourth lecture. Take the top 5 results as feedback documents. Note that you will need an inverted index to do the initial query efficiently, but then a doc:docvec index to extract the document vectors to enhance the query vector. The enhanced query vector can then be evaluated by comparing for similarity with the each document vector in the doc:docvec index. What is the computational complexity of this last operation? Can you think of a way to speed it up?