Workshop: Week 4
PRELIM 1. Scipy, Numpy.
=======================
We're starting to look at methods (e.g., singular value decomposition)
that are too complex for it to be worthwhile to implement them
from scratch. From now on, we'll make more use of existing
software tools and libraries to perform some of the basic analyses.
In Python, many advanced analytic tools build upon two Python
packages (or extensions): NumPy and SciPy (http://www.numpy.org
and http://scipy.org/scipylib).
NumPy is a package for numerical analysis. It provides an
efficient N-dimensional array object, and direct support
for matrix algebra operations upon it (for instance, matrix
addition or multiplication as single operators). The
array object (whether as a 1-d vector, a 2-d matrix, or a
3+-d tensor) is the fundamental data structures for many
Python scientific and analytic packages.
SciPy extends NumPy with a several components useful in
scientific computing, such as statistical distributions,
sparse matrices, and clustering (it, for instance, provides
implementations of hierarchical clustering).
Install NumPy and SciPy on your system. If you are not
familiar with the packages, read through the NumPy tutorial
at:
http://wiki.scipy.org/Tentative_NumPy_Tutorial
and try at least some of the example code.
=====================================================================
Question 1: Using SciPy for Clustering
======================================
To get started with using SciPy, used the scipy function
scipy.cluster.hierarchy.single() to redo Question 3 of
last week's worksheet. Use
scipy.cluster.hierarchy.dendrogram() to calculate the
dendrogram representation, and then pylab.savefig() to store
it.
You'll need to calculate the pairwise doc--doc similarities,
and place them in an upper-triangular matrix;
cluster.hierarchy.single() will then cluster them for you.
As before, be aware of the $O(n^2)$ time and (because of
representation required by the function) space complexity
here. Start with a small number of documents (say, 100),
then progressively double until running time or space
requirements become uncomfortable.
Question 2: Singular value decomposition
========================================
Scipy provides singular value decomposition through the
scipy.linalg.svd function.
Consider the following TDM (taken from Manning and Shutze,
Chapter 18, "Intro to IR")
d1 d2 d3 d4 d5 d6
ship 1 0 1 0 0 0
boat 0 1 0 0 0 0
ocean 1 1 0 0 0 0
voyage 1 0 0 1 1 0
trip 0 0 0 1 0 1
Convert this into a numpy matrix (using numpy.matrix()),
then use scipy.linalg.svd to perform the singular
value decomposition. Do this as:
import numpy as np
import sciply.linalg as la
X = np.matrix( /* the above TDM encoded as matrix */)
(U, s, V_T) = la.svd(X)
Print out U, s, V_T.
Note: due to imprecision in FP operations, you can get
extremely small values where there should be zeros.
To set these to zero for, say, the U matrix,
do:
U[np.absolute(U) < 0.000001] = 0.0
Note, in passing, how numpy allows you to do this
sort of operation as single array operation, rather
than having to use loops.
Compare to the answers given by Schutze and Manning (pp415ff).
You should get the same _absolute_ values, but the signs
may differ. (Why?)
Use np.shape() to check the shapes of U and V_T.
Now multiply U, S, and V_T back together:
X = U * S * V_T
Note that s just gives the singular values, i.e. the diagonal S.
To reconstitute S as a diagonal matrix, you need to do:
S_ = np.diagflat(s)
Then, to get the shape right, you need to add on all-0 rows or
columns. The following code will do this _for this particular
example_, but whether you need additional rows or columns,
and how many, will vary between problems.
empty_col = np.array([[0.0] * 5).transpose()
S = np.concatenate((S, empty_col), 1)
Then use np.dot() to do the matrix multiplication (one matrix
pair at a time).
Check that the output of U * S * V_T is the same as the
original input (again, allowing for rounding error).
Question 3: Dimensionality reduction
====================================
Take your code for Question 2, but add in dimensionality
reduction. Allow the user to specify a dimensionality $k$
(here, $k <= 5$; why?). Set the appropriate singular values
in $s$ to zero to perform the dimensionality reduction (which
are these?)
Shrink the dimensionality to 4, 3, 2, 1, and at each
reduced dimensionality calculate:
X_K = U * S_K * V_T
How quickly does X_K diverge from X? Does this relate
to the size of the singular values?
Question 4: LSA
===============
We will use the gensim package for performing LSA (aka
LSI) analysis. See:
http://radimrehurek.com/gensim/
for a description of gensim and how to install it.
Read the gensim tutorial at:
http://radimrehurek.com/gensim/tutorial.html
specifically the sections marked "Corpora and Vector
Spaces" (for understanding the representations used
internally by gensim) and "Topics and Transformations"
(which discusses, very briefly, how to perform the
LSA/LSI transformation).
The file:
http://codalism.com/comp90042/wksh/wk04/code/index.py
contains Python code for a creating an indexed corpus
in a format that Gensim can work with. Running this
program as:
python index.py ../../dat/lyrl_tokens_30k.dat idx
will create index files with the prefix "idx". (Of course,
you have to specify the correct path to your copy of the
LYRL 30k dataset.)
Then the code fragment:
import index
from gensim import corpora
term_dict = corpora.Dictionary.load(index.get_dict_fname(index_tag))
corpus = corpora.MmCorpus(index.get_corpus_fname(index_tag))
will load the corpus representation and the dictionary mapping
term ids to term strings. Extend this to perform a full LSA
decomposition on the LYRL 30k collection. Extract the top
20 terms from the 20 topics. Compare the topic 6 topics with
the results given in lecture notes. Look at the remaining
14 topics. How many of these appear coherent to you?
How do you interpret the negatively-weighted terms?
Question 5: LSA querying
========================
The next section of the gensim tutorial explains how to run
queries against the LSA index. Implement this on the LYRL
index. Select some queries that you think are under-specified
(that is, where the concepts expressed by the query terms
could also be expressed by other query terms). (You could
use the topics identified above to help guide you in this.)
Run the queries, and retrieve the documents, as word lists.
(The file index.py has a function lyrl_to_docs() which will
load the document word lists as a 2-d array.) How
successful is LSA at "bridging the semantic gap"? Get
some of the documents which have been matched from:
http://www.williamwebber.com/rcv1v2/?docid=46825&user=foo
(see worksheet for week 2 for how this service works).
Do these seem like correct matches for the query?