gnosygnu's comments

gnosygnu · on Nov 26, 2013

This is an interesting idea. When Wikimedia finalizes an incremental backup solution, it may be possible. They'll release a dump with incremental additions / updates / deletions. You would then have XOWA accept the additions / updates, but ignore all the deletions.

It would place more responsibility on the user to maintain their copy of the dump though.

eksith · on Nov 26, 2013

That's true. The burden then shifts to the user. But in a way, that's also good because then you can choose which snapshot to follow.

It's a bit like maintaining your copy of an OS. You can stick to the "stable" branch or, if you're feeling adventurous, you can switch to "release". If you're really into the bleeding edge, you can go with the "nightly" build.

All-in-all, I really like this.

One concern I have is the possible increased bandwidth load for WP. Maybe you can include a small icon or notification to support it by donations. Couldn't hurt to have one there for yourself as well.

gnosygnu · on Nov 27, 2013

Source control is an interesting analogy. In the same vein, when a user syncs their version with the main branch, there will be hundreds of thousands of changes to review. It'll be pretty harrowing for anyone to figure out what to keep / reject. Just something to consider.

Anyway, thanks for the food for thought as well as your suggestion. I added donation links for archive.org and wikipedia tonight.

gnosygnu · on Nov 26, 2013

Thanks for giving it a try. Kiwix is definitely more polished in UI, especially as it has been around for 5+ years. I'd like to think that though XOWA isn't as friendly UI wise, it offers a lot more power / options.

Regarding images. There is some assembly acquired, but I tried to make the instructions as simple as possible. If you look at http://xowa.sourceforge.net/setup_simplewiki.html, then there should be two steps:

* Download the .7z file from from archive.org: http://archive.org/details/Xowa_simplewiki_2013-10-30_images... * Unzip the .7z file to your XOWA directory. If you're on Windows and have C:\xowa as your folder, you should get a file called C:\xowa\file\simple.wikipedia.org\fsdb.main\fsdb.abc.sqlite3 as well as many others

enwiki is a little more difficult, but only in that it requires downloading more files.

Let me know if you run into other issues. I'm going off to work now, but I'll check again later.

EDIT: I forgot to add that if you set up ImageMagick and Inkscape (installation instructions are on XOWA's Main_Page), you can download images dynamically for each article (i.e.: you don't need to download the entire image dump first)

AjithAntony · on Nov 28, 2013

Thanks your your reply. I did see the things you mention regarding images, but the gap is that I'm exporting a private mediawiki, not one of the well known wikis that you have added explicit support for.

I tried tar'ing up my images directory from the server, and unpacking them in a few locations on the filesystem that looked like likely places, but that didn't work. The filesystem layout was kinda confusing with the "user" and "wiki" separation.

How would one prepare a similar image database for an unsupported wiki? I expect this is a custom thing you prepared as opposed to the xml text dump which is a standard mediawiki dump format.

As to the imagemagik part, it doesn't work for an unsupported wiki. Also, it would be impractical for me to manually crawl my whole site triggering downloads of images, and even if I did that, it is unclear how to package and deploy it. The deployment needs to be completely offline becuase there is no Internet at the prison.

Overall, setting up one of the well known wikis is probably pretty smooth, but the private wiki requires a lot of technical knowledge about implementation details that make this tool impractical for unskilled users. Right now, kiwix deployment it is close to ideal. I just need to instruct the unskilled user to replace the ZIM file.

There is one small deficiency in the kiwix deployment in that the automatic index files are user specific unless prepared in advance and recorded in the libary.xml file, so in practice I had to prepare a script to make sure the index and library were right. The actual deployment is "copy zim files to this dir, then double-click on this script"

gnosygnu · on Nov 28, 2013

Hey. I just happened to check this thread and saw your response.

To answer your question, yes: the image databases were prepared with expectations of a standard Wikimedia wiki. These wikis have a standard file layout of wikipedia/wikidomain/thumb/hash0/hash01/name_of_file/thumbnail_file/; EX: wikipedia/commons/thumb/9/97/The_Earth_seen_from_Apollo_17.jpg/270px-The_Earth_seen_from_Apollo_17.jpg.

If you're using a MediaWiki installation, your files should be laid out similarly. You can change the XOWA config file to explicitly specify this WMF layout. XOWA allows the user to work directly with the WMF tarballs, so this should work for you as well. You can look at this thread for another user's attempts: https://sourceforge.net/p/xowa/discussion/general_archived/t... If you have questions, feel free to ask / post.

The other alternative is that XOWA should have the ability to read from a non-Wikimedia directory. Another user asked for this for his own private wiki: https://sourceforge.net/p/xowa/tickets/159/. In this scenario, you'd have all your files in some root directory (C:\images) and XOWA would index the directory and look-up the file by filename. You would probably need imageMagick and inkscape installed though.

Regarding your other point: I will probably centralize all the directories, instead of spreading them out between /wiki/, /file/, /user/. I had a reason for this layout, but it's causing confusion among a few users. You could always zip the files with relative paths, and instruct the users to unzip the zip. For example, the XOWA wikiquote package is one zip file: https://archive.org/details/Xowa_enwikiquote_2013-11-19_comp.... If you unzip it in the /xowa/ dir, it will automatically put all files into relevant folders

In the end, if you have a routine set up for kiwix, you're probably best sticking with it. Keep in mind that XOWA does offer some other nice features that you may / may not need. (editable wiki pages; Wikimedia Lua code). It also offers a lot customization. For example, one of the users added Mathjax to XOWA on his own. (he then proceeded to add a lot more: sorting / collapsing, wikidata skin, redlinks, etc.)

Let me know if you're interested, and I'll see what I can do to help. Otherwise, thanks for the use case scenario. It's definitely something I'll consider supporting in the future!

AjithAntony · on Dec 2, 2013

Thanks for being so responsive. I'll take a look at your suggestions later, and I'll continue this thread on the mailing list.

gnosygnu · on Nov 26, 2013

If you're interested, they update the calculation for printed pages here: https://en.wikipedia.org/wiki/Wikipedia:Size_in_volumes. I think the estimate is for 1.9 million pages

Also, there was another page along the lines of what you're thinking: https://en.wikipedia.org/wiki/Wikipedia:Terminal_Event_Manag.... Although it was an April Fool's day joke, it does give some idea of what's involved.

gnosygnu · on Nov 26, 2013

That's pretty impressive. I never had the patience to sit through a full MediaWiki import for en.wikipedia.org.

Just to be clear, XOWA isn't an installer for MediaWiki, but it's own app. This allows it to avoid the dependency on the entire MediaWiki tool-chain (apache, php, mysql, MediaWiki). Unfortunately, this means that XOWA has to reproduce the same logic, which is quite a challenge...

fauigerzigerk · on Nov 26, 2013

It is indeed a challenge. The mediawiki syntax is the weirdest mess I have ever had to parse. There is no spec, real world usage deviates significantly from the help docs, and it's a Turing complete language with heaps of backwards compatibility hacks. So if you have something reasonably complete and correct than kudos to you!

gnosygnu · on Nov 26, 2013

Thanks. The syntax was challenging, especially all the template syntax ("{{my_template|{{{argument1|defaultvalue|{{nested_template}}}}}}}"). Fortunately, the new lua module should eventually replace the template syntax, which should make it easier for future parsers.

tommorris · on Nov 26, 2013

The visual editor uses a new parser, Parsoid, which has been implemented separately in node.js (iirc). That may be the answer...

gnosygnu · on Nov 26, 2013

Yup. It also has its own DOM, rather than continuously adding to one string and repeatedly running regex's on it (which is what MediaWiki does today).

I was already pretty far along with my own parser before Parsoid was usable though. (and my parser has its own DOM / hooks)

tommorris · on Nov 26, 2013

MediaWiki is such an astoundingly fugly piece of software.

adrianb · on Nov 26, 2013

Wouldn't it be easier to include all the original tools in a packaged form instead of reproducing their logic?

gnosygnu · on Nov 26, 2013

Yes, this would be the ideal approach, but it can become quite complicated (b/c the tool-chain needs to be installed for different machines). In addition, the official.xml importer (importDump.php) is not really up to the task (slow / sometimes buggy).

If you're interested in going this route, you can look at http://www.nongnu.org/wp-mirror/. This should build a local MediaWiki instance with one click. Keep in mind that it's a bit slow: it takes two days to build simple.wikipedia.org with images. In contrast, XOWA sets this up in about 30 min

gnosygnu · on Nov 26, 2013

Good observation, but I just want to point out that the full dumps are downloaded from different servers. They are even mirrored by other institutions. See https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_...

cdash · on Nov 26, 2013

They could set up a torrent as well and almost assuredly people will seed it full time on servers.

rakoo · on Nov 26, 2013

There is a torrent distribution channel: see https://meta.wikimedia.org/wiki/Data_dump_torrents

However it suffers the number one torrent issue: they do not tolerate change. This means that

- When an article changes, you need to generate a new torrent - When a new torrent describes the archive, it needs to be downloaded from scratch by all peers, so that the maximum number of peers are available for a newcomer.

I hope you'll understand that this is not the official way to distribute archives...

gnosygnu · on Nov 26, 2013

Yeah, but they're working on incremental updates, so this should make the "from scratch" part much easier. You can look at https://www.mediawiki.org/wiki/Incremental_dumps and http://lists.wikimedia.org/pipermail/xmldatadumps-l/2013-Aug... if you're interested.

gnosygnu · on Nov 26, 2013

Kiwix's Android app and the full-text search are both great features.

However, I'll point out that Kiwix has not updated English Wikipedia since January 2012. Also, XOWA works directly with the Wikimedia dumps (http://dumps.wikimedia.org/backup-index.html) so it's (a) always up to date and (b) can work on any wiki (Kiwix needs to release the zim file first)

Also, XOWA can run from an external SD card (including FAT32 formatted ones)

ddeck · on Nov 26, 2013

Another Android option is Fastwiki [1]. No images, but provides a conversion tool to convert native Wikimedia dumps. Also works with older Android versions.

[1] http://fastwiki.qwrite.info/en/index.html

gnosygnu · on Nov 26, 2013

I personally use it for traveling. There are a few other applications as well:

- low-bandwidth availability, particularly in less-developed regions of the world

- censorship evasion

- security concerns. some users want to access Wikipedia without exposing their machine to the internet

There are probably a few others I'm missing....

greenyoda · on Nov 26, 2013

"censorship evasion..."

Unfortunately, having your own copy of Wikipedia could also be used to enable censorship. For example, a fundamentalist school could have their own version of Wikipedia from which they've purged all articles about evolution, etc. Then they could configure their firewall to block the real Wikipedia.

gnosygnu · on Nov 26, 2013

Agreed. However, I think it would be less work for them to block access through firewall policy, than to remove them from XOWA.

By and large, for most private individuals, an offline app would allow them to evade censorship. I'd hope that this benefit outweighs the risk of the other's abuse.

Someone · on Nov 26, 2013

A firewall would not hide the fact that censoring takes place. You would have to rewrite content to do that. That might be easier in batch, especially if you are going to do NLP to make cut up sentences grammatical.

gnosygnu · on Nov 26, 2013

Ahh.... That's pretty devious. I was thinking of blocking the entire article, not rewriting content. Still not worth the work IMHO, but who knows what censorship servants would do.

gnosygnu · on Nov 26, 2013

You can try the low-space import. There are instructions in XOWA at home/wiki/Help:Options/Import. It takes longer (8-10 hours) but only needs 35 - 40 GB. (which is still a lot).

In the end, you're still going to need about 25 GB for English Wikipedia. If you want something smaller, you can try one of the other wikis (for example, Wiktionary, Wikiquote, Wikisource, etc.) Each of these are generally about 5 GB.

Hope this helps.

voltagex_ · on Nov 26, 2013

It's more the bandwidth required - even on a 10 megabit (optimistic) connection, a 10GB download takes quite a while.

gnosygnu · on Nov 26, 2013

Fair enough. It takes about 2 - 3 hours to download for me (standard US residential connection). I sometimes let it run overnight.