This file is a list of word associations extracted from the 88 million sentence 
ukWac corpus that is available for download at:
http://wacky.sslmit.unibo.it/doku.php?id=corpora

This file was generated by using a simple word co-occurrence counting algorithm
that was developed as part of a PyCon 2010 talk that is available to watch here:
http://pycon.blip.tv/file/3259632/

If you just want the slides, they are available at:
http://us.pycon.org/2010/conference/schedule/event/98/

In the last couple of slides, there is a pointer to a PDF that explains
how to run the hadoopified algorithm and where the code can be downloaded
etc. However, if you just want the raw output of the algorithm that
can be downloaded at:
http://www.umiacs.umd.edu/~nmadnani/pycon/results.zip

Please note that the word list contains several words that might
not really seem like words because the corpus is that of webpages
in the .uk domain and so there will be lots of neologisms, misspellings
and just plain crazy words. You might need to do a bit of pre-processing
(like intersecting with a reasonable dictionary) to use this list in 
an application.

Another important note: I found that sometimes the word "best associated"
with a stimulus word was the word itself. This is simply an artefact
of using word co-occurrence counting to represent word association. 
In an actual psycholinguistics experiment on word association, the
response will never be the same as the stimulus. Therefore, if
I ever encounter this situation, I choose the second-best word association
as the answer. 

If you want to generate the word list from the results yourself, 
you will need this script:
http://www.umiacs.umd.edu/~nmadnani/pycon/extract_best.py

If you have any questions or comments, please feel free to email me at:
nmadnani@umiacs.umd.edu.


