Same experience here. Been building a classical music database [1] where histori...

davidsojevic · on Sept 3, 2024

If you haven't considered it, you can also use the direct wikitext markup, from which the HTML is derived.

Depending on how you use it, the wikitext may or may not be more ingestible if you're passing it through to an LLM anyway. You may also be able to pare it down a bit by heading/section so that you can reduce it do only sections that are likely to be relevant (eg. "Life and career") type sections.

You can also download full dumps [0] from Wikipedia and query them via SQL to make your life easier if you're processing them.

[0] https://en.wikipedia.org/wiki/Wikipedia:Database_download#Wh...?

zulko · on Sept 3, 2024

> reduce it do only sections that are likely to be relevant (eg. "Life and career")

True but I also managed to do this from HTML. I tried getting pages wikitext through the API but couldn't find how to.

Just querying the HTML page was less friction and fast enough that I didn't need a dump (although when AI becomes cheap enough, there is probably a lot of things to do from a wikipedia dump!).

One advantage of using online wikipedia instead of a dump is that I have a pipeline on Github Actions where I just enter a composer name and it automagically scrapes the web and adds the composer to the database (takes exactly one minute from the click of the button!).

distances · on Sept 3, 2024

Wikipedia's api.php supports JSON output, which probably helps already quite a bit. For example https://en.wikipedia.org/w/api.php?action=query&prop=extract...

zulko · on Sept 4, 2024

Oooh I had missed that thanks!

iudqnolq · on Sept 4, 2024

This doesn't directly address your issue but since this caused me some pain I'll share that if you want to parse structured information from Wikipedia infoboxes the npm module wtf_wikipedia works.