Friday, November 5, 2010

FOAF, crawlers, etc

 

As part of my research, I have to gather a large number of FOAF profiles. So I looked for some crawlers able to get the profiles (without having to write one myself, which is boring, considering the flexibility of RDF -- and consequently the number of outbound links which could be meaningful).

 

So... looking for FOAF crawlers (which are semantic crawlers, also called scutters). The first one I found is Futil (funny name choice, BTW). The project page gives some information on the project (quite generic) and has a link to the web svn browsing page. I did not find references to the actual svn repo (but I could easily infer it). Ok. So far so good.

So I examined the list of dependencies the project has:

  • Python >= 2.4.0
  • RDFLib >= 2.3.1
  • PyLucene >= 2.0.0
  • PySQLite >= 2.3.2
  • Python-MySQLdb >= 1.2.1
  • PyXML >= 0.8.4
  • Python-URL >= 1.0.1
The first dependency is obvious. It's a python project so we need a python environment. And 2.4 is a very low requirement. Good. RDFLib can be easily installed with easy_install, even on windows. Really, easy_install RDFLib

 

just works! Ok. Then it comes to PyLucene: I found it quite impossible to install on windows. I'm sure I can work it out, but instructions are terrible. There is a binary build project, however, it does not work. At least, not the simple instructions they give. I installed the egg and then:

C:\Users\enrico>python
Python 2.6.5 (r265:79096, Mar 19 2010, 21:48:26) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
[1]: import lucene
Traceback (most recent call last):
File "stdin", line 1, in module
File "C:\Python26\lib\site-packages\lucene-2.9.3-py2.6-win32.egg\lucene\__init__.py", line 2, in module
import os, _lucene
ImportError: DLL load failed: The specified module could not be found.

Ok. Things are getting nasty and for no reason. Compiling Python modules on Windows is usually a PITA. Especially because I have Visual Studio 2010 and not Visual Studio 2009 or 2008 or whatever is the default compiler for Python on Windows. Thanks to the little "limited compile yourself dll hell" I'm not going that route (moreover, I'm a gcc guy).

So I tried the futil package without pylucene. Maybe, I thought, something works even without PyLucene. Afterall, Python is a very dynamic language. It is trivial to create a program/library which does not load things it does not use. So I tries just:

 

python futil-run.py --help<br />
What I got was an error related to not having MySQLdb. What the fuck! I asked for an help message. Nothing more! I need no stinking mysql! Just a bloody help message. Consequently, I suppose that the whole project needs everything installed, even if I don't use it. Which is bad.

 

I suppose I'll try with slug , written in Java, and see what happens.

No comments: