Exciting updates to the Wikipedia Monthly dataset for November! 🚀
・ Fixed a bug to remove infobox leftovers and other wiki markers such as __TOC__ ・ New python package https://pypi.org/project/wikisets: a dataset builder with efficient sampling so you can combine the languages you want seamlessly for any date (ideal for pretraining data but works for any purpose) ・ Moved the pipeline to a large server. Much higher costs but with better reliability and predictability (let me know if you'd like to sponsor this!). ・ Dataset sizes are unfortunately missing for this month due to shenanigans with the migration, but should be back in December's update.