09.05.2023 Views

pdfcoffee

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Word Embeddings

Classifying with BERT ‒ command line

As mentioned in the previous section, it is often not necessary to fine-tune the

pretrained model, and you can build classifiers directly on top of the pretrained

model. The run_classifier.py script allows you to run either model against

your own data. The script provides input parsers for several popular formats. In

our example we will use the Corpus of Linguistic Acceptability (COLA) [39] format

for single sentence classification and the Microsoft Research Paraphrase Corpus

(MRPC) format [40] for sentence pair classification. The format is specified using

the --task_name parameter.

For single-sentence classification, your training and validation input should be

specified in separate TSV files named train.tsv and dev.tsv respectively, using

the following format required by the COLA parser. Here {TAB} indicates the tab

separator character. The "junk" string is just a placeholder and ignored. The class

label needs to be an integer value corresponding to the class label of the training

and validation record. The ID field is just a running number. The parser will prepend

train, dev, or test to the ID value so they don't have to be unique across the TSV files:

id {TAB} class-label {TAB} "junk" {TAB} text-of-example

Your test file should be specified as another TSV file named test.tsv and have

the following format. In addition, it should have a header shown as follows:

id {TAB} sentence

1 {TAB} text-of-test-sentence

...

For sentence pair classification, the formats for the train.tsv and dev.tsv files

required by the MRPC parser should be as follows:

id {TAB} "junk" {TAB} "junk" {TAB} sentence-1 {TAB} sentence-2

And the corresponding format for test.tsv should be as follows:

label {TAB} "junk" {TAB} "junk" {TAB} sentence-1 {TAB} sentence-2

Put these three files into the data/my_data folder. You can then train your classifier

using the pretrained BERT language model using the following command at the

root of the BERT project. If you prefer to use a fine-tuned version, point the --init_

checkpoint to the checkpoint files generated as a result of fine tuning instead. The

following command will train the classifier with a maximum sentence length 128

and batch size 8 for 2 epochs and learning rate of 2e-5. It will output a file test_

results.tsv in the ${TRAINED_CLASSIFIER} folder, with the predictions from the

trained model against the test data, as well as write out the checkpoint files for the

trained model in the same directory:

[ 270 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!