张力板如何才能比上一步画出更多的评价?画出、比上、评价、更多

2023-09-03 09:03:52 作者:深情只是我担不起的负担

我正在尝试绘制一个连续的图表来评估我的模型。 Tensorboard(v2.4.1)成功绘制了每一步的不同损失。 然而,它只绘制了评估的最后一步,我的评估曲线上只有一个点。

这是我的拉伸板视图: Tensorboard show only the last step's evaluation

我使用以下命令运行tensorboard:tensorboard --logdir=models/my_model 我使用以下命令运行评估:python model_main_tf2.py --model_dir=models/my_model --pipeline_config_path=models/my_model/pipeline.config --checkpoint_dir=models/my_model --run_once=True

线稿怎样才能画出张力

以下是我的管道配置文件:

model {
  ssd {
    num_classes: 1
    image_resizer {
      keep_aspect_ratio_resizer {
        min_dimension: 768
        max_dimension: 768
        pad_to_max_dimension: true
      }
    }
    feature_extractor {
      type: "ssd_efficientnet-b2_bifpn_keras"
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 3.9999998989515007e-05
          }
        }
        initializer {
          truncated_normal_initializer {
            mean: 0.0
            stddev: 0.029999999329447746
          }
        }
        activation: SWISH
        batch_norm {
          decay: 0.9900000095367432
          scale: true
          epsilon: 0.0010000000474974513
        }
        force_use_bias: true
      }
      bifpn {
        min_level: 3
        max_level: 7
        num_iterations: 5
        num_filters: 112
      }
    }
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 1.0
        x_scale: 1.0
        height_scale: 1.0
        width_scale: 1.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
        use_matmul_gather: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    box_predictor {
      weight_shared_convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 3.9999998989515007e-05
            }
          }
          initializer {
            random_normal_initializer {
              mean: 0.0
              stddev: 0.009999999776482582
            }
          }
          activation: SWISH
          batch_norm {
            decay: 0.9900000095367432
            scale: true
            epsilon: 0.0010000000474974513
          }
          force_use_bias: true
        }
        depth: 112
        num_layers_before_predictor: 3
        kernel_size: 3
        class_prediction_bias_init: -4.599999904632568
        use_depthwise: true
      }
    }
    anchor_generator {
      multiscale_anchor_generator {
        min_level: 3
        max_level: 7
        anchor_scale: 4.0
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        scales_per_octave: 3
      }
    }
    post_processing {
      batch_non_max_suppression {
        score_threshold: 9.99999993922529e-09
        iou_threshold: 0.5
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_loss {
        weighted_sigmoid_focal {
          gamma: 1.5
          alpha: 0.25
        }
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    encode_background_as_zeros: true
    normalize_loc_loss_by_codesize: true
    inplace_batchnorm_update: true
    freeze_batchnorm: false
    add_background_class: false
  }
}
train_config {
  batch_size: 16
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    random_scale_crop_and_pad_to_square {
      output_size: 768
      scale_min: 0.10000000149011612
      scale_max: 2.0
    }
  }
  sync_replicas: true
  optimizer {
    momentum_optimizer {
      learning_rate {
        cosine_decay_learning_rate {
          learning_rate_base: 0.07999999821186066
          total_steps: 100000
          warmup_learning_rate: 0.0010000000474974513
          warmup_steps: 2500
        }
      }
      momentum_optimizer_value: 0.8999999761581421
    }
    use_moving_average: false
  }
  fine_tune_checkpoint: "pre-trained-models/efficientdet_d2_coco17_tpu-32/checkpoint/ckpt-0"
  num_steps: 200000
  startup_delay_steps: 0.0
  replicas_to_aggregate: 8
  max_number_of_boxes: 100
  unpad_groundtruth_tensors: false
  fine_tune_checkpoint_type: "detection"
  use_bfloat16: false
  fine_tune_checkpoint_version: V2
}
train_input_reader: {
  label_map_path: "annotations/label_img_train.pbtxt"
  tf_record_input_reader {
    input_path: "annotations/train.record"
  }
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  num_visualizations: 30
  save_graph: true
  batch_size: 1;
}

eval_input_reader: {
  label_map_path: "annotations/label_img_test.pbtxt"
  shuffle: false
  num_epochs: 1
  tf_record_input_reader {
    input_path: "annotations/test.record"
  }
}

您有解决它的办法吗?

推荐答案

02.2022

TL;DR:

由于验证过程仅考虑最新的检查点,因此Tensorboard仅绘制了一个用于验证的点。完成培训后,将检查点送入验证过程,该过程仅考虑最后一个检查点。

长答案:

验证过程应与培训过程同时运行(请参阅下面的code)。这在类似笔记本的环境中通常是不可能的,例如CoLab,因为脚本是逐步执行的(一次一个单元;在每个单元中,一次一个命令)。

def eval_continuously(
    pipeline_config_path,
    config_override=None,
    train_steps=None,
    sample_1_of_n_eval_examples=1,
    sample_1_of_n_eval_on_train_examples=1,
    use_tpu=False,
    override_eval_num_epochs=True,
    postprocess_on_cpu=False,
    model_dir=None,
    checkpoint_dir=None,
    wait_interval=180,
    timeout=3600,
    eval_index=0,
    save_final_config=False,
    **kwargs):
  """Run continuous evaluation of a detection model eagerly.
  This method builds the model, and continously restores it from the most
  recent training checkpoint in the checkpoint directory & evaluates it
  on the evaluation data.
但是,可能有some tricks要解决的问题(让它们同时运行)。然而,在我看来,这可能不是最好的,因为培训和验证都会争夺系统资源,特别是GPU,这可能会导致培训时间较慢。

理想的解决方案是在一台计算机上运行培训,将检查点保存到共享磁盘(例如Google Drive)中,然后让另一台计算机运行验证,在该验证中获取上述检查点。两台计算机应同时运行。

然而,仍然存在挑战:在函数中eval_continuously(),line:

for latest_checkpoint in tf.train.checkpoints_iterator( # Here!
      checkpoint_dir, timeout=timeout, min_interval_secs=wait_interval):
    ckpt = tf.compat.v2.train.Checkpoint(
        step=global_step, model=detection_model, optimizer=optimizer)

checkpoints_iterator(),说明如下:

def checkpoints_iterator(checkpoint_dir,
                         min_interval_secs=0,
                         timeout=None,
                         timeout_fn=None):
  """Continuously yield new checkpoint files as they appear.
  The iterator only checks for new checkpoints when control flow has been
  reverted to it. This means it can miss checkpoints if your code takes longer
  to run between iterations than `min_interval_secs` or the interval at which
  new checkpoints are written.

也就是说,该函数可能会错过一些检查点。我可以想象的一个转折点是:验证过程花费的时间太长(例如,由于数据集太大)。目前,正在写入更多新的检查点文件。当该函数开始再次检查检查点时,它只考虑最新的检查点,这意味着验证将错过最新检查点之前的检查点!(有关证明,请参阅here)。因此,用于绘制验证的数据将不那么细粒度。