Daniel Voigt Godoy - Deep Learning with PyTorch Step-by-Step A Beginner’s Guide-leanpub
OutputTrainingArguments(output_dir=tmp_trainer, overwrite_output_dir=False, do_train=False, do_eval=None, do_predict=False,evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False,per_device_train_batch_size=8, per_device_eval_batch_size=8,gradient_accumulation_steps=1, eval_accumulation_steps=None,learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9,adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0,num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/Apr21_20-33-20_MONSTER, logging_strategy=IntervalStrategy.STEPS,logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=500, save_total_limit=None,no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None,tpu_metrics_debug=False, debug=False, dataloader_drop_last=False,eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=tmp_trainer, disable_tqdm=False, remove_unused_columns=True,label_names=None, load_best_model_at_end=False,metric_for_best_model=None, greater_is_better=None,ignore_data_skip=False, sharded_ddp=[], deepspeed=None,label_smoothing_factor=0.0, adafactor=False, group_by_length=False,length_column_name=length, report_to=['tensorboard'],ddp_find_unused_parameters=None, dataloader_pin_memory=True,skip_memory_metrics=False, _n_gpu=1, mp_parameters=)The Trainer creates an instance of TrainingArguments by itself, and the valuesabove are the arguments' default values. There is the learning_rate=5e-05, andthe num_train_epochs=3.0, and many, many others. The optimizer used, eventhough it’s not listed above, is the AdamW, a variation of Adam.We can create an instance of TrainingArguments ourselves to get at least a bit ofcontrol over the training process. The only required argument is the output_dir, butwe’ll specify some other arguments as well:Fine-Tuning with HuggingFace | 995
Training Arguments1 from transformers import TrainingArguments2 training_args = TrainingArguments(3 output_dir='output',4 num_train_epochs=1,5 per_device_train_batch_size=1,6 per_device_eval_batch_size=8,7 evaluation_strategy='steps',8 eval_steps=300,9 logging_steps=300,10 gradient_accumulation_steps=8,11 )"Batch size ONE?! You gotta be kidding me!"Well, I would, if it were not for the gradient_accumulation_steps argument. That’show we can make the mini-batch size larger even if we’re using a low-end GPUthat is capable of handling only one data point at a time.The Trainer can accumulate the gradients computed at every training step (whichis taking only one data point), and, after eight steps, it uses the accumulatedgradients to update the parameters. For all intents and purposes, it is as if themini-batch had size eight. Awesome, right?Moreover, let’s set the logging_steps to three hundred, so it prints the traininglosses every three hundred mini-batches (and it counts the mini-batches as havingsize eight due to the gradient accumulation)."What about validation losses?"The evaluation_strategy argument allows you to run an evaluation after everyeval_steps steps (if set to steps like in the example above) or after every epoch (ifset to epoch)."Can I get it to print accuracy or other metrics too?"Sure, you can! But, first, you need to define a function that takes an instance ofEvalPrediction (returned by the internal validation loop), computes the desiredmetrics, and returns a dictionary:996 | Chapter 11: Down the Yellow Brick Rabbit Hole
- Page 970 and 971: Figure 11.20 - Losses—Transformer
- Page 972 and 973: Outputtensor([[[2.6334e-01, 6.9912e
- Page 974 and 975: I want to introduce you to…ELMoBo
- Page 976 and 977: OutputToken: 32 watchThe get_token(
- Page 978 and 979: Helper Function to Retrieve Embeddi
- Page 980 and 981: Output(tensor(-0.5047, device='cuda
- Page 982 and 983: torch.all(new_flair_sentences[0].to
- Page 984 and 985: Outputtensor(0.3504, device='cuda:0
- Page 986 and 987: We can leverage this fact to slight
- Page 988 and 989: We can easily get the embeddings fo
- Page 990 and 991: Figure 11.24 - Losses—simple clas
- Page 992 and 993: We can inspect the pre-trained mode
- Page 994 and 995: Every word piece is prefixed with #
- Page 996 and 997: far, our models used these embeddin
- Page 998 and 999: position_ids = torch.arange(512).ex
- Page 1000 and 1001: Pre-training TasksMasked Language M
- Page 1002 and 1003: Then, let’s create an instance of
- Page 1004 and 1005: If these two sentences were the inp
- Page 1006 and 1007: The BERT model may take many other
- Page 1008 and 1009: The contextual word embeddings are
- Page 1010 and 1011: Model Configuration1 class BERTClas
- Page 1012 and 1013: "Which BERT is that? DistilBERT?!"D
- Page 1014 and 1015: Well, you probably don’t want to
- Page 1016 and 1017: set num_labels=1 as argument.If you
- Page 1018 and 1019: Output{'attention_mask': [1, 1, 1,
- Page 1022 and 1023: Method for Computing Accuracy1 def
- Page 1024 and 1025: loaded_model = (AutoModelForSequenc
- Page 1026 and 1027: logits.logits.argmax(dim=1)Outputte
- Page 1028 and 1029: For a complete list of available ta
- Page 1030 and 1031: [215]. For a demo of GPT-2’s capa
- Page 1032 and 1033: in Chapter 9, and I reproduce it be
- Page 1034 and 1035: Data Preparation1 auto_tokenizer =
- Page 1036 and 1037: Data Preparation1 lm_train_dataset
- Page 1038 and 1039: The training arguments are roughly
- Page 1040 and 1041: device_index = (model.device.indexi
- Page 1042 and 1043: • learning that a language model
- Page 1044 and 1045: [167] https://huggingface.co/docs/d
Output
TrainingArguments(output_dir=tmp_trainer, overwrite_output_dir=
False, do_train=False, do_eval=None, do_predict=False,
evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False,
per_device_train_batch_size=8, per_device_eval_batch_size=8,
gradient_accumulation_steps=1, eval_accumulation_steps=None,
learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9,
adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0,
num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType
.LINEAR, warmup_ratio=0.0, warmup_steps=0, logging_dir=runs/Apr21_20
-33-20_MONSTER, logging_strategy=IntervalStrategy.STEPS,
logging_first_step=False, logging_steps=500, save_strategy
=IntervalStrategy.STEPS, save_steps=500, save_total_limit=None,
no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend
=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None,
tpu_metrics_debug=False, debug=False, dataloader_drop_last=False,
eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name
=tmp_trainer, disable_tqdm=False, remove_unused_columns=True,
label_names=None, load_best_model_at_end=False,
metric_for_best_model=None, greater_is_better=None,
ignore_data_skip=False, sharded_ddp=[], deepspeed=None,
label_smoothing_factor=0.0, adafactor=False, group_by_length=False,
length_column_name=length, report_to=['tensorboard'],
ddp_find_unused_parameters=None, dataloader_pin_memory=True,
skip_memory_metrics=False, _n_gpu=1, mp_parameters=)
The Trainer creates an instance of TrainingArguments by itself, and the values
above are the arguments' default values. There is the learning_rate=5e-05, and
the num_train_epochs=3.0, and many, many others. The optimizer used, even
though it’s not listed above, is the AdamW, a variation of Adam.
We can create an instance of TrainingArguments ourselves to get at least a bit of
control over the training process. The only required argument is the output_dir, but
we’ll specify some other arguments as well:
Fine-Tuning with HuggingFace | 995