[fedora-india] Follow up of talk at freed.in 09, offline-wkipedia available online..

Mon Mar 16 19:14:39 UTC 2009

Hello all,
Previously i posted about this project at ILUG-D mailing list
http://www.mail-archive.com/ilugd@lists.linux-delhi.org/msg23879.html, but
at that time, it had lots of dependency which made it hard(rather
impossible) to install and test and get back with feedback.
I presented this project at freed.in 09 (slides and sample are available at
http://code.google.com/p/offline-wikipedia/downloads/list) and both the time
feed back i got was exciting. It all started with idea from Stian Haklev(
http://reganmian.net/blog/), who was attempting for same, and was targeting
to get entire English wikipedia, onto a DVD, which can be distributed and
used freely. So here i am with http://92.243.5.147/offline-wiki it includes
two zipped files.

Procedure is simple enough:

   - Extract both of them.
   - Place the exact location of blocks/xml_block/, in
   offline-wikipedia/page/class_con.py file.
   - Run the server using command :  $./manage.py runserver.
   - Now in browser, you can access any article via opening the link:
   http://localhost:8000/wiki/xyz/ (please remembering the following '/'
   failing which it gives the error for URL resolving).

This is about how to make it work, now about the content of setup:

   - block.tgz contains xml_blocks folder with some 20k odd files, which are
   small chunks of huge XML dump provided by media-wiki (
   http://download.wikimedia.org/enwiki/).
   - offline-wikipedia.tgz has django setup, and csv files which have list
   of all the articles present in XML dump.
   - Some other files like segregate.py, index.py, index_file.py which i
   used to create indexing, both db and csv files, i have tried to document my
   steps, but still in case of confusion let me know.
   - Media content, like css files, images logos i have taken from
   media-wiki site itself without making changes.

Major Concern/Reaction of the audience/users (Future targets):

   - How to keep it updated.
   - How to make it editable.
   - How to manage different categories of articles, and segregation based
   on that to make refined and better education/learning tool(Rahul Sundram).

Issues that are at hand:

   - From my-side, apart from following
   http://users.softlab.ece.ntua.gr/~ttsiod/buildWikipediaOffline.html, i
   tried to make this thing work via django, it posed new problem of writing
   converter for wiki-markup-text to html, which as of now, is not perfect,
   needs improvement to utilise all the content available at hand, Here i am
   ready to trade off with any parser irrespective of language (python/PHP),
   but it should be better the this one, PHP i avoided, as for that, it will
   need Apache web server, which would be overkill.
   - I am using django server, it can be replaced by simple python web
   server which can handle css, other requests, as i am not using any of other
   features provided by django(link MVC).
   - To make accessibility of individual article fast, am breaking huge file
   to small resulting in more then 20k odd files, any way we can skip that,
   hint idea help would be great.
   - Last years October dump was 4.1G and this March 09 dump is already 4.6G
   making things more difficult.
   - Adding options of going live, searching for articles, making updates
   readily available.

This is all about project, till now i am in conversation with Stian and
Imran(from AU-KBC research lab) about the possible usage and options(i also
talked to rahul and rakesh and shirish at freed), but am sure, there are
lots of options, comments and suggestions among other people which can help
develop this and make it highly useful project. I would really like to get
some feed-back, response, help, guidance, so kindly reply back with valid
comments so that we can get to best conclusion and result.
This blocks.tgz file is huge, and i know it would be really difficult for
many to download it and try it, so there is sample.tar.bz2 on google code,
you can try that, I will update one more sample by tomorrow on
http://92.243.5.147/offline-wiki/, which is small enough and and at the same
time handy to check out the present condition.

-- 
Regards
Shantanu

PS: till now, pranav, nandeep and emmanuel were the one who tried it and got
back with feed back, comments, suggestions, hope this time i get more.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/fedora-india/attachments/20090317/2510d055/attachment.htm>