The Joshua Pipeline

This page describes the Joshua pipeline script, which manages the complexity of training and evaluating machine translation systems. The pipeline eases the pain of two related tasks in statistical machine translation (SMT) research:

  1. Training SMT systems involves a complicated process of interacting steps that are time-consuming and prone to failure.

  2. Developing and testing new techniques requires varying parameters at different points in the pipeline. Earlier results (which are often expensive) need not be recomputed.

To facilitate these tasks, the pipeline script: - Runs the complete SMT pipeline, from corpus normalization and tokenization, through model building, tuning, test-set decoding, and evaluation.

The Joshua pipeline script is designed in the spirit of Moses’ train-model.pl, and shares many of its features. It is not as extensive, however, as Moses’ Experiment Management System.

Installation

The pipeline has no required external dependencies. However, it has support for a number of external packages, some of which are included with Joshua.

Make sure that the environment variable $JOSHUA is defined, and you should be all set.

A basic pipeline run

The pipeline takes a set of inputs (training, tuning, and test data), and creates a set of intermediate files in the run directory. By default, the run directory is the current directory, but it can be changed with the --rundir parameter.

For this quick start, we will be working with the example that can be found in $JOSHUA/examples/pipeline. This example contains 1,000 sentences of Urdu-English data (the full dataset is available as part of the Indian languages parallel corpora with 100-sentence tuning and test sets with four references each.

Running the pipeline requires two main steps: data preparation and invocation.

  1. Prepare your data. The pipeline script needs to be told where to find the raw training, tuning, and test data. A good convention is to place these files in an input/ subdirectory of your run’s working directory (NOTE: do not use data/, since a directory of that name is created and used by the pipeline itself). The expected format (for each of training, tuning, and test) is a pair of files that share a common path prefix and are distinguished by their extension:

    input/
          train.SOURCE
          train.TARGET
          tune.SOURCE
          tune.TARGET
          test.SOURCE
          test.TARGET
    

    These files should be parallel at the sentence level (with one sentence per line), should be in UTF-8, and should be untokenized (tokenization occurs in the pipeline). SOURCE and TARGET denote variables that should be replaced with the actual target and source language abbreviations (e.g., “ur” and “en”).

  2. Run the pipeline. The following is the minimal invocation to run the complete pipeline:

    $JOSHUA/scripts/training/pipeline.pl  \
      --corpus input/train                \
      --tune input/tune                   \
      --test input/devtest                \
      --source SOURCE                     \
      --target TARGET
    

    The --corpus, --tune, and --test flags define file prefixes that are concatened with the language extensions given by --target and --source (with a “.” in betwee). Note the correspondences with the files defined in the first step above. The prefixes can be either absolute or relative pathnames. This particular invocation assumes that a subdirectory input/ exists in the current directory, that you are translating from a language identified “ur” extension to a language identified by the “en” extension, that the training data can be found at input/train.en and input/train.ur, and so on.

Assuming no problems arise, this command will run the complete pipeline in about 20 minutes, producing BLEU scores at the end. As it runs, you will see output that looks like the following:

[train-copy-en] rebuilding...
  dep=/Users/post/code/joshua/test/pipeline/input/train.en 
  dep=data/train/train.en.gz [NOT FOUND]
  cmd=cat /Users/post/code/joshua/test/pipeline/input/train.en | gzip -9n > data/train/train.en.gz
  took 0 seconds (0s)
[train-copy-ur] rebuilding...
  dep=/Users/post/code/joshua/test/pipeline/input/train.ur 
  dep=data/train/train.ur.gz [NOT FOUND]
  cmd=cat /Users/post/code/joshua/test/pipeline/input/train.ur | gzip -9n > data/train/train.ur.gz
  took 0 seconds (0s)
...

And in the current directory, you will see the following files (among other intermediate files generated by the individual sub-steps).

data/
    train/
        corpus.ur
        corpus.en
        thrax-input-file
    tune/
        tune.tok.lc.ur
        tune.tok.lc.en
        grammar.filtered.gz
        grammar.glue
    test/
        test.tok.lc.ur
        test.tok.lc.en
        grammar.filtered.gz
        grammar.glue
alignments/
    0/
        [berkeley aligner output files]
    training.align
thrax-hiero.conf
thrax.log
grammar.gz
lm.gz
tune/
    1/
        decoder_command
        joshua.config
        params.txt
        joshua.log
        mert.log
        joshua.config.ZMERT.final
    final-bleu

These files will be described in more detail in subsequent sections of this tutorial.

Another useful flag is the --rundir DIR flag, which chdir()s to the specified directory before running the pipeline. By default the rundir is the current directory. Changing it can be useful for organizing related pipeline runs. Relative paths specified to other flags (e.g., to --corpus or --lmfile) are relative to the directory the pipeline was called from, not the rundir itself (unless they happen to be the same, of course).

The complete pipeline comprises many tens of small steps, which can be grouped together into a set of traditional pipeline tasks:

  1. Data preparation
  2. Alignment
  3. Parsing
  4. Grammar extraction
  5. Language model building
  6. Tuning
  7. Testing

These steps are discussed below, after a few intervening sections about high-level details of the pipeline.

Grammar options

Joshua can extract two types of grammars: Hiero-style grammars and SAMT grammars. As described on the file formats page, both of them are encoded into the same file format, but they differ in terms of the richness of their nonterminal sets.

Hiero grammars make use of a single nonterminals, and are extracted by computing phrases from word-based alignments and then subtracting out phrase differences. More detail can be found in Chiang (2007) [PDF]. SAMT grammars make use of a source- or target-side parse tree on the training data, projecting constituent labels down on the phrasal alignments in a variety of configurations. SAMT grammars are usually many times larger and are much slower to decode with, but sometimes increase BLEU score. Both grammar formats are extracted with the Thrax software.

By default, the Joshua pipeline extract a Hiero grammar, but this can be altered with the --type samt flag.

Other high-level options

The following command-line arguments control run-time behavior of multiple steps:

Restarting failed runs

If the pipeline dies, you can restart it with the same command you used the first time. If you rerun the pipeline with the exact same invocation as the previous run (or an overlapping configuration – one that causes the same set of behaviors), you will see slightly different output compared to what we saw above:

[train-copy-en] cached, skipping...
[train-copy-ur] cached, skipping...
...

This indicates that the caching module has discovered that the step was already computed and thus did not need to be rerun. This feature is quite useful for restarting pipeline runs that have crashed due to bugs, memory limitations, hardware failures, and the myriad other problems that plague MT researchers across the world.

Often, a command will die because it was parameterized incorrectly. For example, perhaps the decoder ran out of memory. This allows you to adjust the parameter (e.g., --joshua-mem) and rerun the script. Of course, if you change one of the parameters a step depends on, it will trigger a rerun, which in turn might trigger further downstream reruns.

Skipping steps, quitting early

You will also find it useful to start the pipeline somewhere other than data preparation (for example, if you have already-processed data and an alignment, and want to begin with building a grammar) or to end it prematurely (if, say, you don’t have a test set and just want to tune a model). This can be accomplished with the --first-step and --last-step flags, which take as argument a case-insensitive version of the following steps:

We now discuss these steps in more detail.

## 1. DATA PREPARATION

Data prepare involves doing the following to each of the training data (--corpus), tuning data (--tune), and testing data (--test). Each of these values is an absolute or relative path prefix. To each of these prefixes, a “.” is appended, followed by each of SOURCE (--source) and TARGET (--target), which are file extensions identifying the languages. The SOURCE and TARGET files must have the same number of lines.

For tuning and test data, multiple references are handled automatically. A single reference will have the format TUNE.TARGET, while multiple references will have the format TUNE.TARGET.NUM, where NUM starts at 0 and increments for as many references as there are.

The following processing steps are applied to each file.

  1. Copying the files into RUNDIR/data/TYPE, where TYPE is one of “train”, “tune”, or “test”. Multiple --corpora files are concatenated in the order they are specified. Multiple --tune and --test flags are not currently allowed.

  2. Normalizing punctuation and text (e.g., removing extra spaces, converting special quotations). There are a few language-specific options that depend on the file extension matching the two-letter ISO 639-1 designation.

  3. Tokenizing the data (e.g., separating out punctuation, converting brackets). Again, there are language-specific tokenizations for a few languages (English, German, and Greek).

  4. (Training only) Removing all parallel sentences with more than --maxlen tokens on either side. By default, MAXLEN is 50. To turn this off, specify --maxlen 0.

  5. Lowercasing.

This creates a series of intermediate files which are saved for posterity but compressed. For example, you might see

data/
    train/
        train.en.gz
        train.tok.en.gz
        train.tok.50.en.gz
        train.tok.50.lc.en
        corpus.en -> train.tok.50.lc.en

The file “corpus.LANG” is a symbolic link to the last file in the chain.

## 2. ALIGNMENT

Alignments are between the parallel corpora at RUNDIR/data/train/corpus.{SOURCE,TARGET}. To prevent the alignment tables from getting too big, the parallel corpora are grouped into files of no more than ALIGNER_CHUNK_SIZE blocks (controlled with a parameter below). The last block is folded into the penultimate block if it is too small. These chunked files are all created in a subdirectory of RUNDIR/data/train/splits, named corpus.LANG.0, corpus.LANG.1, and so on.

The pipeline parameters affecting alignment are:

When alignment is complete, the alignment file can be found at RUNDIR/alignments/training.align. It is parallel to the training corpora. There are many files in the alignments/ subdirectory that contain the output of intermediate steps.

## 3. PARSING

When SAMT grammars are being built (--type samt), the target side of the training data must be parsed. The pipeline assumes your target side will be English, and will parse it for you using the Berkeley parser, which is included. If it is not the case that English is your target-side language, the target side of your training data (found at CORPUS.TARGET) must already be parsed in PTB format. The pipeline will notice that it is parsed and will not reparse it.

Parsing is affected by both the --threads N and --jobs N options. The former runs the parser in multithreaded mode, while the latter distributes the runs across as cluster (and requires some configuration, not yet documented). The options are mutually exclusive.

Once the parsing is complete, there will be two parsed files:

## 4. THRAX (grammar extraction)

The grammar extraction step takes three pieces of data: (1) the source-language training corpus, (2) the target-language training corpus (parsed, if an SAMT grammar is being extracted), and (3) the alignment file. From these, it computes a synchronous context-free grammar. If you already have a grammar and wish to skip this step, you can do so passing the grammar with the --grammar GRAMMAR flag.

The main variable in grammar extraction is Hadoop. If you have a Hadoop installation, simply ensure that the environment variable $HADOOP is defined, and Thrax will seamlessly use it. If you do not have a Hadoop installation, the pipeline will roll out out for you, running Hadoop in standalone mode. (This mode is triggered when $HADOOP is undefined). Theoretically, any grammar extractable on a full Hadoop cluster should be extractable in standalone mode, if you are patient enough; in practice, you probably are not patient enough, and will be limited to smaller datasets. Setting up your own Hadoop cluster is not too difficult a chore; in particular, you may find it helpful to install a pseudo-distributed version of Hadoop. In our experience, this works fine, but you should note the following caveats:

Here are some flags relevant to Hadoop and grammar extraction with Thrax:

When the grammar is extracted, it is compressed and placed at RUNDIR/grammar.gz.

## 5. Language model

Before tuning can take place, a language model is needed. A language model is always built from the target side of the training corpus unless --no-corpus-lm is specified. In addition, you can provide other language models (any number of them) with the --lmfile FILE argument. Other arguments are as follows.

A language model built from the target side of the training data is placed at RUNDIR/lm.gz.

Interlude: decoder arguments

Running the decoder is done in both the tuning stage and the testing stage. A critical point is that you have to give the decoder enough memory to run. Joshua can be very memory-intensive, in particular when decoding with large grammars and large language models. The default amount of memory is 3100m, which is likely not enough (especially if you are decoding with SAMT grammar). You can alter the amount of memory for Joshua using the --joshua-mem MEM argument, where MEM is a Java memory specification (passed to its -Xmx flag).

## 6. TUNING

Two optimizers are implemented for Joshua: MERT and PRO (--tuner {mert,pro}). Tuning is run till convergence in the RUNDIR/tune directory. By default, tuning is run just once, but the pipeline supports running the optimizer an arbitrary number of times due to recent work pointing out the variance of tuning procedures in machine translation, in particular MERT. This can be activated with --optimizer-runs N. Each run can be found in a directory RUNDIR/tune/N.

When tuning is finished, each final configuration file can be found at either

RUNDIR/tune/N/joshua.config.ZMERT.final
RUNDIR/tune/N/joshua.config.PRO.final

where N varies from 1..--optimizer-runs.

## 7. Testing

For each of the tuner runs, Joshua takes the tuner output file and decodes the test set. Afterwards, by default, minimum Bayes-risk decoding is run on the 300-best output. This step usually yields about 0.3 - 0.5 BLEU points but is time-consuming, and can be turned off with the --no-mbr flag.

After decoding the test set with each set of tuned weights, Joshua computes the mean BLEU score, writes it to RUNDIR/test/final-bleu, and cats it. That’s the end of the pipeline!

Joshua also supports decoding further test sets. This is enabled by rerunning the pipeline with a number of arguments:

COMMON USE CASES AND PITFALLS

FEEDBACK

Please email joshua_support@googlegroups.com with problems or suggestions.