華為極客周活動(dòng)-昇騰萬(wàn)里--模型王者挑戰(zhàn)賽VQVAE調(diào)通過(guò)程 下
現(xiàn)在的進(jìn)展問(wèn)題:
1?依瞳系統(tǒng)暫停之后,再開(kāi),那些文件又要重新打一遍補(bǔ)丁
2 issue3還沒(méi)有解決,要持續(xù)關(guān)注這個(gè)issue?https://gitee.com/ascend/modelzoo/issues/I28YYG
Issue3?解決過(guò)程
經(jīng)排查,QueueDequeueMany輸出shape與TF不一致系上游算子RandomShuffleQueue中shapes屬性未向下傳遞導(dǎo)致
已聯(lián)系負(fù)責(zé)該算子的開(kāi)發(fā)人員進(jìn)行修復(fù)
問(wèn)已經(jīng)解決,請(qǐng)?zhí)鎿Q附件中的文件至如下目錄,替換前請(qǐng)先備份。
/home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so
這樣issue3已經(jīng)解決,又打一個(gè)補(bǔ)丁文件。
Issue4?報(bào)錯(cuò)(AOE)
打上issue3的補(bǔ)丁,還是報(bào)錯(cuò)。由于時(shí)間有點(diǎn)久,那個(gè)報(bào)錯(cuò)也沒(méi)有什么特征性的東西,所以從log上看不出是否修正了我以前的那個(gè)bug。原諒我有時(shí)候也偷懶,沒(méi)有仔細(xì)比對(duì)報(bào)錯(cuò)信息。
發(fā)現(xiàn)aoe已經(jīng)提交issue了,我在升級(jí)完issue的補(bǔ)丁后,我們的報(bào)錯(cuò)信心一樣,于是就持續(xù)關(guān)注這個(gè)issue和解決方案,這個(gè)也算是issue4了:
https://gitee.com/ascend/modelzoo/issues/I2A7SC
返回信息,可以參考下方鏈接中“sess.run模式下開(kāi)啟混合計(jì)算”的方式將tf.train.shuffle_batch和tf.train.string_input_producer設(shè)置為不下沉,再試試網(wǎng)絡(luò)是否可以跑起來(lái)
https://support.huaweicloud.com/mprtg-A800_9000_9010/atlasprtg_13_0033.html
說(shuō)用下混合計(jì)算,看到文檔里說(shuō)混合計(jì)算模式下,iterations_per_loop必須為1。不過(guò)我在代碼里沒(méi)有找到這個(gè)關(guān)鍵字,那是否意味著我不用考慮iterations_per_loop的實(shí)際取值呢?(后來(lái)通過(guò)溝通知道,混合模式iterations_per_loop就已經(jīng)設(shè)為1了)
用戶還可通過(guò)without_npu_compile_scope自行配置不下沉的算子。
于是按照說(shuō)明修改代碼82行:
# change to?不下沉
with npu_scope.without_npu_compile_scope():
filename_queue = tf.train.string_input_producer(filenames,num_epochs=num_epochs)
因?yàn)锳OE的問(wèn)題解決了,他的issue關(guān)閉,但是我的問(wèn)題還沒(méi)解決,所以我報(bào)了自己的issue4:
Issue4報(bào)錯(cuò)
提交issue:
https://gitee.com/ascend/modelzoo/issues/I2AMHH
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.FixedLengthRecordDataset`.
WARNING:tensorflow:From cifar10.py:305: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
2020-12-23 22:22:55.165648: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost
2020-12-23 22:22:55.166302: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_5 success. [0 ms]
2020-12-23 22:22:55.166352: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.
2020-12-23 22:22:55.166546: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is True
2020-12-23 22:22:55.166574: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1
2020-12-23 22:22:55.166804: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_9 begin.
2020-12-23 22:22:55.166829: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is True
2020-12-23 22:22:55.166876: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1
2020-12-23 22:22:55.167661: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 1, hasMakeIteratorOp:0, hasIteratorOp:0
2020-12-23 22:22:55.167710: I tf_adapter/util/npu_ops_identifier.cc:67] [MIX] Parsing json from /home/HwHiAiUser/Ascend/ascend-toolkit/latest/arm64-linux/opp/framework/built-in/tensorflow/npu_supported_ops.json
2020-12-23 22:22:55.169692: I tf_adapter/util/npu_ops_identifier.cc:69] 690 ops parsed
2020-12-23 22:22:55.170185: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [2 ms]
2020-12-23 22:22:55.176442: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1
2020-12-23 22:22:55.176485: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 382, max nodes count: 377 in subgraph: GeOp9_0 minGroupSize: 1
2020-12-23 22:22:55.176643: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_9 markForPartition success.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "cifar10.py", line 517, in
extract_z(**config)
File "cifar10.py", line 330, in extract_z
sess.run(init_op)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list
2020-12-23 22:22:56.194723: I tf_adapter/util/ge_plugin.cc:56] [GePlugin] destroy constructor begin
2020-12-23 22:22:56.194890: I tf_adapter/util/ge_plugin.cc:195] [GePlugin] Ge has already finalized.
2020-12-23 22:22:56.194990: I tf_adapter/util/ge_plugin.cc:58] [GePlugin] destroy constructor end
后來(lái)無(wú)法復(fù)現(xiàn),就關(guān)閉了。但是今天又復(fù)現(xiàn)了。
發(fā)現(xiàn)原來(lái)依瞳系統(tǒng)下面,python和python3.7指向竟然不是同一個(gè)。
model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$? which python
/usr/bin/python
model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$? which python3.7
/usr/local/bin/python3.7
model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$ which pip
/usr/local/bin/pip
因此應(yīng)該用usr/local/bin這個(gè)目錄下的python,也就是python3.7
檢查tf.train.batch里面的設(shè)置,尤其是allow_smaller_final_batch的設(shè)置,發(fā)現(xiàn)一共出現(xiàn)3處,唯一起作用那處已經(jīng)改成了False
images,labels = tf.train.batch(
[image,label],
batch_size=BATCH_SIZE,
num_threads=1,
capacity=BATCH_SIZE,
allow_smaller_final_batch=False)
將代碼改成非aoe模式,因?yàn)閍oe他的規(guī)避方式不復(fù)合比賽的要求。
將其它兩處train.Batch?改成了不下沉。混合計(jì)算里的不下沉,就是將不下沉的語(yǔ)句用這個(gè)語(yǔ)句包起來(lái):
with npu_scope.without_npu_compile_scope():
由于修改好幾個(gè)地方,程序被改的面目全非,因此又重新測(cè)試cpu代碼,發(fā)現(xiàn)cifar10的cpu代碼竟然也不通過(guò)了(這就是vqvae這個(gè)模型難的地方,改一點(diǎn)點(diǎn)地方,報(bào)錯(cuò)就不一樣,甚至感覺(jué)沒(méi)改哪些地方,它自己也會(huì)莫名其妙的不通了)。
又廢了好大勁才終于又調(diào)通了cpu代碼。Cpu代碼程序單獨(dú)寫(xiě)為cifar_base.py。然后再比著調(diào)通npu程序代碼,也就是如果把npu相關(guān)代碼都屏蔽掉,npu程序也是能跑通的。
Issue5的報(bào)錯(cuò),是shape沒(méi)有對(duì)齊
https://gitee.com/ascend/modelzoo/issues/I2AVI5
應(yīng)該是data batch那里沒(méi)有把最后一段丟棄的緣故,
images,labels = tf.train.batch(
[image,label],
batch_size=BATCH_SIZE,
num_threads=1,
capacity=BATCH_SIZE,
allow_smaller_final_batch=False)
加上黑體部分就ok了。
報(bào)issue5.1
https://gitee.com/ascend/modelzoo/issues/I2B2US
提到:麻煩以后出現(xiàn)DEVMM報(bào)錯(cuò)時(shí),敲dmesg獲取一下內(nèi)核日志,方便定位,謝謝~
[ERROR] DEVMM(25538,python3.7):2020-12-27-12:52:10.065.876 [hardware/build/../dev_platform/devmm/devmm/devmm_svm.c:268][devmm_copy_ioctl 268]
但是這個(gè)問(wèn)題并不容易復(fù)現(xiàn),在我的系統(tǒng)里偶爾能復(fù)現(xiàn),在研發(fā)那塊復(fù)現(xiàn)也很困難。
結(jié)果元旦后第一個(gè)工作日:新年新氣象,今天略微修改了下代碼,竟然跑通了,我都很驚訝。
數(shù)據(jù)讀取部分用了混合計(jì)算不下沉,原則上沒(méi)有修改骨干代碼?,但是元旦那天還不行,今天稍微改了下代碼,就跑通了?。
這個(gè)issue的問(wèn)題解決了,關(guān)閉。
上面的記錄文字很短,其實(shí)這個(gè)issue花費(fèi)的時(shí)間非常多,從2020年的年尾,一直到2021年的年初,兩頭占著算兩年時(shí)間,中間代碼改的面目全非,bug的樣子也是日新月異,可以說(shuō)最后成功的喜悅有多大,中間的情緒低落就有多深。
還沒(méi)有完成的issue6?報(bào)錯(cuò)
這回的報(bào)錯(cuò)沒(méi)有提交issue6?。
報(bào)錯(cuò)信息:
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
WARNING:tensorflow:From cn.py:346: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
2021-01-04 16:05:37.573107: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost
2021-01-04 16:05:37.574181: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_15 success. [0 ms]
2021-01-04 16:05:37.574252: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.
2021-01-04 16:05:37.574439: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is True
2021-01-04 16:05:37.574461: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1
2021-01-04 16:05:37.574658: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_29 begin.
2021-01-04 16:05:37.574679: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is True
2021-01-04 16:05:37.574689: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1
2021-01-04 16:05:37.575437: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 1, hasMakeIteratorOp:0, hasIteratorOp:0
2021-01-04 16:05:37.575952: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [0 ms]
2021-01-04 16:05:37.582660: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1
2021-01-04 16:05:37.582750: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 382, max nodes count: 377 in subgraph: GeOp29_0 minGroupSize: 1
2021-01-04 16:05:37.583336: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_29 markForPartition success.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer_2/limit_epochs/epochs/Assign is not in white list
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "cn.py", line 558, in
extract_z(**config)
File "cn.py", line 371, in extract_z
sess.run(init_op)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer_2/limit_epochs/epochs/Assign is not in white list
2021-01-04 16:05:38.559795: I tf_adapter/util/ge_plugin.cc:56] [GePlugin] destroy constructor begin
2021-01-04 16:05:38.560022: I tf_adapter/util/ge_plugin.cc:195] [GePlugin] Ge has already finalized.
2021-01-04 16:05:38.560042: I tf_adapter/util/ge_plugin.cc:58] [GePlugin] destroy constructor end
看到這句提示,是否要修改代碼呢?
WARNING:tensorflow:From cn.py:347: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
WARNING:tensorflow:From cn.py:347: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.
Instructions for updating:
Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).
現(xiàn)在報(bào)錯(cuò)信息為:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "cn.py", line 559, in
extract_z(**config)
File "cn.py", line 372, in extract_z
sess.run(init_op)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output:?input_producer/limit_epochs/epochs/Assign is not in white list
查找,發(fā)現(xiàn)有可能是沒(méi)有正確初始化導(dǎo)致的,于是加上這句試試:
sess.graph.finalize()
還是同樣的報(bào)錯(cuò)。
Main代碼:
init_op = tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer())
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Run!
config = tf.ConfigProto()
# config.gpu_options.allow_growth = True
custom_op =? config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name =? "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True #在昇騰AI處理器執(zhí)行訓(xùn)練
custom_op.parameter_map["mix_compile_mode"].b =? True
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF? #關(guān)閉remap開(kāi)關(guān)
sess = tf.Session(config=config)
# sess.graph.finalize()
sess.run(init_op)
print("="*1000, "run sess.run(init_op) OK!")
summary_writer = tf.summary.FileWriter(LOG_DIR,sess.graph)
# logging.warning("dch summary_writer")
summary_writer.add_summary(config_summary.eval(session=sess))
# logging.warning("dch summary_writer.add")
extract_z代碼:
with npu_scope.without_npu_compile_scope():
images,labels = tf.train.batch(
[image,label],
batch_size=BATCH_SIZE,
num_threads=1,
capacity=BATCH_SIZE,
allow_smaller_final_batch=False)
# <<<<<<<
# images = images.batch(batch_size, drop_remainder=True)
# >>>>>>> MODEL
with tf.variable_scope('net'):
with tf.variable_scope('params') as params:
pass
x_ph = tf.placeholder(tf.float32,[BATCH_SIZE,32,32,3])
net= VQVAE(None,None,BETA,x_ph,K,D,_cifar10_arch,params,False)
init_op = tf.group(tf.global_variables_initializer(),
tf.local_variables_initializer())
# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Run!
config = tf.ConfigProto()
# config.gpu_options.allow_growth = True
custom_op =? config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name =? "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True #在昇騰AI處理器執(zhí)行訓(xùn)練
custom_op.parameter_map["mix_compile_mode"].b =? True #?測(cè)試算子下沉
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF? #關(guān)閉remap開(kāi)關(guān)
sess = tf.Session(config=config)
logger.warn('warn sess = tf.Session(config=config)')
# sess = tf.Session()
sess.graph.finalize()
sess.run(init_op)
logger.warn('warn sess.run(init_op)')
最終采用將這句話里的epoch=1參數(shù)去掉,終于能夠通過(guò)了?。
# image,label = get_image(num_epochs=1)
image,label = get_image()
這個(gè)解決方法可能不是最終解決方法,先這樣處理。
issue7報(bào)錯(cuò):
是train_prior部分:? config['TRAIN_NUM'] = 8 # 9個(gè)之后會(huì)報(bào)錯(cuò)
報(bào)issue:https://gitee.com/ascend/modelzoo/issues/I2BUME/
報(bào)錯(cuò)信息:
2021-01-04 20:30:06.952771: I tf_adapter/kernels/geop_npu.cc:573] [GEOP] RunGraphAsync callback, status:0, kernel_name:GeOp75_0[ 6456228us]
50%|██████████████████████████████████████████████????????????????????????????????????????????? | 9/18 [01:36<01:36, 10.67s/it]
Traceback (most recent call last):
File "cifar10.py", line 531, in
train_prior(config=config,**config)
File "cifar10.py", line 476, in train_prior
sess.run(sample_summary_op,feed_dict={sample_images:sampled_ims}),it)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1156, in _run
(np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (20, 32, 32, 3) for Tensor 'misc/Placeholder:0', which has shape '(1, 32, 32, 3)'
這個(gè)問(wèn)題現(xiàn)在還沒(méi)有解決,完全不知道問(wèn)題出在哪里。?大約跟數(shù)據(jù)的喂入有關(guān)系,但是我目前解決不了,只能先設(shè)為8:config['TRAIN_NUM'] = 8保證整個(gè)程序能跑通,這個(gè)issue先留著吧。
PR提交,最后的沖刺
經(jīng)過(guò)艱苦卓絕的奮斗,終于迎來(lái)了模型大賽的曙光,整個(gè)模型能夠在升騰系統(tǒng)上跑通了,而且基本符合大賽的要求。后面就是一些微調(diào)了。
審核方給的修改意見(jiàn)
1?程序設(shè)置了對(duì)tf.train.string_input_producer這個(gè)的不下沉
要把這個(gè)去掉,只開(kāi)啟混合計(jì)算
2 VQ-VAE的網(wǎng)絡(luò)問(wèn)題:網(wǎng)絡(luò)結(jié)構(gòu)中數(shù)據(jù)預(yù)處理的方式是通過(guò)這個(gè)循環(huán)控制的,這個(gè)循環(huán)在數(shù)據(jù)達(dá)到上限后拋出異常,根據(jù)異常來(lái)結(jié)束處理,目前在昇騰產(chǎn)品執(zhí)行會(huì)core。請(qǐng)開(kāi)發(fā)者自行修改成其他控制流程:
while not coord.should_stop():
x,y = sess.run([images,labels])
k = sess.run(net.k,feed_dict={x_ph:x})
ks.append(k)
ys.append(y)
print('.', end='', flush=True)
except tf.errors.OutOfRangeError:
VQVAE PR最后的整改
1?整個(gè)程序只啟動(dòng)混合計(jì)算,把單獨(dú)的不下沉設(shè)置全部去掉。(也就是最終的使用方法,理論上系統(tǒng)把支持的全部下沉,不支持的默認(rèn)就能不下沉,不需要用戶手動(dòng)設(shè)置)
2?將while循環(huán)改成for循環(huán)
for step in tqdm(xrange(TRAIN_NUM), dynamic_ncols=True):
x,y = sess.run([images,labels])
k = sess.run(net.k,feed_dict={x_ph:x})
ks.append(k)
ys.append(y)
并設(shè)置循環(huán)步數(shù):
config['TRAIN_NUM'] = 24
再跟少芳那邊溝通了一下,第二部分能通過(guò)就是將get_image函數(shù)參數(shù)去掉解決的,反正已經(jīng)設(shè)置了循環(huán)步數(shù),這里應(yīng)該不影響整體。
修改:?# image,label = get_image(num_epochs=1)
修改為:?image,label = get_image()
然后提交PR,終于PR驗(yàn)收通過(guò)啦!烏拉!非常激動(dòng)!結(jié)果并不重要,中間出現(xiàn)問(wèn)題、解決問(wèn)題的過(guò)程最重要。但是如果沒(méi)有結(jié)果,這篇文檔都師出無(wú)名,中間付出的精力可能就白白付出了,學(xué)到的東西可能也沒(méi)現(xiàn)在這么多、印象這么深刻。
VQVAE 模型tensorflow遷移到升騰總結(jié)
本次大賽主要經(jīng)歷了報(bào)名、模型選擇、模型遷移、排錯(cuò)、提交PR等幾個(gè)階段,具體過(guò)程如前面篇幅所講,一言難盡啊!
本次模型遷移大賽是很好的一次學(xué)習(xí)和鍛煉的機(jī)會(huì),我原來(lái)對(duì)tensorflow一點(diǎn)都不懂,經(jīng)過(guò)這次比賽,不管懂不懂,反正代碼看了好多遍,tf程序的流程也懂了一點(diǎn)。升騰系統(tǒng)原來(lái)也只是在Modelarts的notebook和訓(xùn)練任務(wù)中有接觸,像這次這樣可以在依瞳系統(tǒng)里自由的安裝軟件、完全控制系統(tǒng)還是第一次。在排錯(cuò)的過(guò)程中,跟華為研發(fā)有了第一線接觸,為及時(shí)準(zhǔn)確的排錯(cuò)能力!對(duì)升騰系統(tǒng)和MindSpore AI框架充滿信心!
模型大賽的白銀賽段很快就要來(lái)了,大家快準(zhǔn)備報(bào)名吧!
大賽 昇騰
版權(quán)聲明:本文內(nèi)容由網(wǎng)絡(luò)用戶投稿,版權(quán)歸原作者所有,本站不擁有其著作權(quán),亦不承擔(dān)相應(yīng)法律責(zé)任。如果您發(fā)現(xiàn)本站中有涉嫌抄襲或描述失實(shí)的內(nèi)容,請(qǐng)聯(lián)系我們jiasou666@gmail.com 處理,核實(shí)后本網(wǎng)站將在24小時(shí)內(nèi)刪除侵權(quán)內(nèi)容。