我正在尝试绘制一个连续的图表来评估我的模型。 Tensorboard(v2.4.1)成功绘制了每一步的不同损失。 然而,它只绘制了评估的最后一步,我的评估曲线上只有一个点。
这是我的拉伸板视图: Tensorboard show only the last step's evaluation
我使用以下命令运行tensorboard:tensorboard --logdir=models/my_model
我使用以下命令运行评估:python model_main_tf2.py --model_dir=models/my_model --pipeline_config_path=models/my_model/pipeline.config --checkpoint_dir=models/my_model --run_once=True
以下是我的管道配置文件:
model {
ssd {
num_classes: 1
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 768
max_dimension: 768
pad_to_max_dimension: true
}
}
feature_extractor {
type: "ssd_efficientnet-b2_bifpn_keras"
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 3.9999998989515007e-05
}
}
initializer {
truncated_normal_initializer {
mean: 0.0
stddev: 0.029999999329447746
}
}
activation: SWISH
batch_norm {
decay: 0.9900000095367432
scale: true
epsilon: 0.0010000000474974513
}
force_use_bias: true
}
bifpn {
min_level: 3
max_level: 7
num_iterations: 5
num_filters: 112
}
}
box_coder {
faster_rcnn_box_coder {
y_scale: 1.0
x_scale: 1.0
height_scale: 1.0
width_scale: 1.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
use_matmul_gather: true
}
}
similarity_calculator {
iou_similarity {
}
}
box_predictor {
weight_shared_convolutional_box_predictor {
conv_hyperparams {
regularizer {
l2_regularizer {
weight: 3.9999998989515007e-05
}
}
initializer {
random_normal_initializer {
mean: 0.0
stddev: 0.009999999776482582
}
}
activation: SWISH
batch_norm {
decay: 0.9900000095367432
scale: true
epsilon: 0.0010000000474974513
}
force_use_bias: true
}
depth: 112
num_layers_before_predictor: 3
kernel_size: 3
class_prediction_bias_init: -4.599999904632568
use_depthwise: true
}
}
anchor_generator {
multiscale_anchor_generator {
min_level: 3
max_level: 7
anchor_scale: 4.0
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
scales_per_octave: 3
}
}
post_processing {
batch_non_max_suppression {
score_threshold: 9.99999993922529e-09
iou_threshold: 0.5
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SIGMOID
}
normalize_loss_by_num_matches: true
loss {
localization_loss {
weighted_smooth_l1 {
}
}
classification_loss {
weighted_sigmoid_focal {
gamma: 1.5
alpha: 0.25
}
}
classification_weight: 1.0
localization_weight: 1.0
}
encode_background_as_zeros: true
normalize_loc_loss_by_codesize: true
inplace_batchnorm_update: true
freeze_batchnorm: false
add_background_class: false
}
}
train_config {
batch_size: 16
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
random_scale_crop_and_pad_to_square {
output_size: 768
scale_min: 0.10000000149011612
scale_max: 2.0
}
}
sync_replicas: true
optimizer {
momentum_optimizer {
learning_rate {
cosine_decay_learning_rate {
learning_rate_base: 0.07999999821186066
total_steps: 100000
warmup_learning_rate: 0.0010000000474974513
warmup_steps: 2500
}
}
momentum_optimizer_value: 0.8999999761581421
}
use_moving_average: false
}
fine_tune_checkpoint: "pre-trained-models/efficientdet_d2_coco17_tpu-32/checkpoint/ckpt-0"
num_steps: 200000
startup_delay_steps: 0.0
replicas_to_aggregate: 8
max_number_of_boxes: 100
unpad_groundtruth_tensors: false
fine_tune_checkpoint_type: "detection"
use_bfloat16: false
fine_tune_checkpoint_version: V2
}
train_input_reader: {
label_map_path: "annotations/label_img_train.pbtxt"
tf_record_input_reader {
input_path: "annotations/train.record"
}
}
eval_config: {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
num_visualizations: 30
save_graph: true
batch_size: 1;
}
eval_input_reader: {
label_map_path: "annotations/label_img_test.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "annotations/test.record"
}
}
您有解决它的办法吗?
由于验证过程仅考虑最新的检查点,因此Tensorboard仅绘制了一个用于验证的点。完成培训后,将检查点送入验证过程,该过程仅考虑最后一个检查点。
def eval_continuously(
pipeline_config_path,
config_override=None,
train_steps=None,
sample_1_of_n_eval_examples=1,
sample_1_of_n_eval_on_train_examples=1,
use_tpu=False,
override_eval_num_epochs=True,
postprocess_on_cpu=False,
model_dir=None,
checkpoint_dir=None,
wait_interval=180,
timeout=3600,
eval_index=0,
save_final_config=False,
**kwargs):
"""Run continuous evaluation of a detection model eagerly.
This method builds the model, and continously restores it from the most
recent training checkpoint in the checkpoint directory & evaluates it
on the evaluation data.
但是,可能有some tricks要解决的问题(让它们同时运行)。然而,在我看来,这可能不是最好的,因为培训和验证都会争夺系统资源,特别是GPU,这可能会导致培训时间较慢。
理想的解决方案是在一台计算机上运行培训,将检查点保存到共享磁盘(例如Google Drive)中,然后让另一台计算机运行验证,在该验证中获取上述检查点。两台计算机应同时运行。
然而,仍然存在挑战:在函数中eval_continuously()
,line:
for latest_checkpoint in tf.train.checkpoints_iterator( # Here!
checkpoint_dir, timeout=timeout, min_interval_secs=wait_interval):
ckpt = tf.compat.v2.train.Checkpoint(
step=global_step, model=detection_model, optimizer=optimizer)
checkpoints_iterator()
,说明如下:
def checkpoints_iterator(checkpoint_dir,
min_interval_secs=0,
timeout=None,
timeout_fn=None):
"""Continuously yield new checkpoint files as they appear.
The iterator only checks for new checkpoints when control flow has been
reverted to it. This means it can miss checkpoints if your code takes longer
to run between iterations than `min_interval_secs` or the interval at which
new checkpoints are written.
也就是说,该函数可能会错过一些检查点。我可以想象的一个转折点是:验证过程花费的时间太长(例如,由于数据集太大)。目前,正在写入更多新的检查点文件。当该函数开始再次检查检查点时,它只考虑最新的检查点,这意味着验证将错过最新检查点之前的检查点!(有关证明,请参阅here)。因此,用于绘制验证的数据将不那么细粒度。