Like what we’re doing? Consider donating. Or check out other ways to help or to get in touch with us.

Manual

From Levitation

Jump to: navigation, search
Todo: This is (at least partially) a first person style text, probably because it has been imported from Scytale’s import.py documentation. It should be rewritten to reflect that this is now a community project.

Contents

Software requirements

  • Python 2.5, although 2.6 is recommended - [1]
  • a reasonably current Git installation (1.5 or higher) - [2]
  • Unix OS strongly recommended for performance reasons (but Windows users, please report back how painful it was!)

User requirements

  • basic to intermediate knowledge of Git
  • VERY much patience!

Hardware requirements

  • lots of disk space
  • 1 GB of RAM or more
  • the faster the CPU, the better (dual core recommended)

Actual requirements will differ depending on the size of the data to import.

Since the code is currently being restructured a bit, we don’t have current numbers for speed and disk usage. As a rule of thumb: You should be able to import a “smaller” Wikipedia (like {pdc,bar,vo}.wikipedia.org) on your laptop.

RAM

Currently, the dump is imported one revision at a time. RAM usage for import.py therefore is well below 1 GB and severe leaks in the Python code have not been found yet.

However, git-fast-import keeps track of a lot of stuff in the background, therefore you should have several GB of RAM available for the actual git-fast-import (about 4-6 GB for dewiki according to one of the testers).

Once your system starts swapping, data throughput will sink remarkably fast, so try to avoid memory shortage.

Hint: If you’re in the middle of a thrashing, unresponsive system, don’t Ctrl-C Python, but kill it using its PID. That’s way faster.

Disk

First of all, there are four distinct places on your mass storage involved in the import process:

  • (a) the input file
  • (b) the output repo
  • (c) Levitation’s metadata files
  • (d) your swap file

If you can, place these on separate physical disks to improve speed. If you don’t have that many disks, at least try to separate (b) and (c). Keeping (d) separate isn’t too important as well: You’re probably lost anyway when your machine starts swapping.

Where (b) is placed is determined by the GIT_DIR variable you pass to git-fast-import (see below where we Todo: discuss the invocation). The (c) files are by default placed as “.import-*” in the current working directory, so cd to where you want them first or use the -{M,C,U,P} parameters to put them somewhere else.

To allow fast random access to the metadata files, Levitation uses a binary format with fixed-length data sets. The file size is therefore determined by the product of entry_length * max_value, where max_value is the highest ID of some kind that appears in the input file. The factors for each meta file are:

meta:  17 bytes per revision  (linking the three other files together)
comm: 257 bytes per revision  (storing “edit summary” lines)
user: 257 bytes per user      (storing the user name)
page: 257 bytes per page      (storing the page title)

Again, it doesn’t matter how many revisions/users/pages you have in the file, it’s their maximum ID number that counts. When importing a MediaWiki, you can read the revision, user and page count on its “Special:Statistics” page. Alternatively you may use getmaxids.sh to retrieve those values from your XML dump.

Example: The highest page ID in your dump is 31337, the highest revision ID is 424242 and the highest user ID is 1701. Then, the meta files will need:

       meta            comm            user            page
  (17 * 424242) + (257 * 424242) + (257 * 1701) + (257 * 31337)
=    7212114    +    109030194   +    437157    +    8053609
= 124733074 bytes (~118 MiB)

Note that these are just the metadata requirements. These files can be deleted after importing. Not deleting them does not improve the speed of future runs, as they are recreated on every invocation.

The Git repository itself usually needs a lot more space. Expect at least(!) the size of the bz2 compressed dump. The actual size will most likely be between 50% and 300% of that value. Yes, that’s pretty vague.

Also, note that the invocation demoed below creates a bare repository (i.e. one without a checked out working tree). You generally want it like that. When you decide to check out the finished repo, make sure the place where you check it out has enough free space and (probably more important) enough free inodes.

About import.py

Currently, Levitation’s main tool is import.py [3], which is essentially a filter. It takes the XML dump data on stdin and outputs a datastream suitable for consumption by git-fast-import. You may use all kinds of Unix magic to produce a valid input stream. This allows you for example to decompress a bz2 compressed file on the fly while reading. Therefore all of the following examples are just that: examples. They show some common usage scenarios, your own command line may vary.

You may import dumps from any MediaWiki in the world, but most of you will probably want to play around with Wikipedia dumps. You can get recent ones either at our very own dump service (which we recommend), or at the Wikimedia dump service.

The pages-meta-history.xml file (with whatever compression) is what we want. It includes all pages in all namespaces and all of their revisions.

Hint: In case you’re wondering: Wikimedia does not offer content SQL dumps anymore, and there are no full-history dumps for en.wikipedia.org because of its size.

Alternatively, you may use a MediaWiki’s Special:Export page to create an XML dump of certain pages.

Basic usage

Execute import.py --help to see all available options.

Hint: Please note that there’s the --max setting which specifies how many pages to import. By default it’s set to 100, so Levitation will only import 100 pages, not more. This protects you from filling your disk when you’re too impatient. ;) Set it to -1 when you’re ready for a “real” run.

This will import the pdc.wikipedia.org dump into a new Git repository “repo”:

 rm -rf repo; git init --bare repo && \
   ./import.py -m -1 < ~/pdcwiki-20091103-pages-meta-history.xml | \
   GIT_DIR=repo git fast-import | \
   sed 's/^progress //'

If you want to know what is slow with the import you might want to have a look at the Profiling section.

Working features

  • Read a Wikipedia XML full-history dump and output it in a format suitable for piping into git-fast-import. The resulting repository contains one file per page.
  • Use the original modification summary as commit message.
  • Read the Wiki URL from the XML file and set user mail addresses accordingly.
  • Use the author name in the commit instead of the user ID.
  • Store additional information in the commit message that specifies page and revision ID as well as whether the edit was marked as “minor”.
  • Use the page’s name as file name instead of the page ID. Non-ASCII characters and some ASCII ones will be replaced by “.XX”, where .XX is their hex value.
  • Put pages in namespace-based subdirectories.
  • Put pages in a configurably deep subdirectory hierarchy.
  • Use command line options instead of hard-coded magic behavior. Thanks to stettberger for adding this.
  • Use a locally timezoned timestamp for the commit date instead of an UTC one.

Missing features

  • Allow IPv6 addresses as IP edit usernames. (Although apparently MediaWiki itself cannot handle IPv6 addresses, so we got some time.)

Cool stuff

  • git checkout master~30000 takes you back 30,000 edits in time — in about a second.
  • The XML data might be in the wrong order to directly create commits from it, but it is in the right order for blob delta compression: When passing blobs to git-fast-import, delta compression will be tried based on the previous blob — which is the same page, one revision before. Therefore, delta compression will succeed and save you tons of storage.

Contact

There are several ways to get in touch with us. If you have any questions, please make sure to read the FAQ first.

IRC

Currently, most communication takes place in #levitation on the freenode IRC network.

Mailing List

Our mailing list is hosted at FreeLists. Subscribe on that page. You may then send mail to levitationFUNKY-DONUT-SYMBOLfreelists.org.

Scytale

import.py was written by Tim “Scytale” Weber. Since it has been announced, the amount of mail in his inbox has increased quite a bit. Therefore, please do not contact him personally if you don’t have to, use IRC or the mailing list instead.

Personal tools