華為極客周活動(dòng)-昇騰萬(wàn)里--模型王者挑戰(zhàn)賽VQVAE調(diào)通過(guò)程 下

      網(wǎng)友投稿 777 2022-05-30

      現(xiàn)在的進(jìn)展問(wèn)題:

      1?依瞳系統(tǒng)暫停之后,再開(kāi),那些文件又要重新打一遍補(bǔ)丁

      2 issue3還沒(méi)有解決,要持續(xù)關(guān)注這個(gè)issue?https://gitee.com/ascend/modelzoo/issues/I28YYG

      Issue3?解決過(guò)程

      經(jīng)排查,QueueDequeueMany輸出shape與TF不一致系上游算子RandomShuffleQueue中shapes屬性未向下傳遞導(dǎo)致

      已聯(lián)系負(fù)責(zé)該算子的開(kāi)發(fā)人員進(jìn)行修復(fù)

      問(wèn)已經(jīng)解決,請(qǐng)?zhí)鎿Q附件中的文件至如下目錄,替換前請(qǐng)先備份。

      /home/HwHiAiUser/Ascend/ascend-toolkit/20.1.rc1/arm64-linux/opp/op_proto/built-in/libopsproto.so

      這樣issue3已經(jīng)解決,又打一個(gè)補(bǔ)丁文件。

      Issue4?報(bào)錯(cuò)(AOE)

      打上issue3的補(bǔ)丁,還是報(bào)錯(cuò)。由于時(shí)間有點(diǎn)久,那個(gè)報(bào)錯(cuò)也沒(méi)有什么特征性的東西,所以從log上看不出是否修正了我以前的那個(gè)bug。原諒我有時(shí)候也偷懶,沒(méi)有仔細(xì)比對(duì)報(bào)錯(cuò)信息。

      發(fā)現(xiàn)aoe已經(jīng)提交issue了,我在升級(jí)完issue的補(bǔ)丁后,我們的報(bào)錯(cuò)信心一樣,于是就持續(xù)關(guān)注這個(gè)issue和解決方案,這個(gè)也算是issue4了:

      https://gitee.com/ascend/modelzoo/issues/I2A7SC

      返回信息,可以參考下方鏈接中“sess.run模式下開(kāi)啟混合計(jì)算”的方式將tf.train.shuffle_batch和tf.train.string_input_producer設(shè)置為不下沉,再試試網(wǎng)絡(luò)是否可以跑起來(lái)

      https://support.huaweicloud.com/mprtg-A800_9000_9010/atlasprtg_13_0033.html

      說(shuō)用下混合計(jì)算,看到文檔里說(shuō)混合計(jì)算模式下,iterations_per_loop必須為1。不過(guò)我在代碼里沒(méi)有找到這個(gè)關(guān)鍵字,那是否意味著我不用考慮iterations_per_loop的實(shí)際取值呢?(后來(lái)通過(guò)溝通知道,混合模式iterations_per_loop就已經(jīng)設(shè)為1了)

      用戶還可通過(guò)without_npu_compile_scope自行配置不下沉的算子。

      于是按照說(shuō)明修改代碼82行:

      # change to?不下沉

      with npu_scope.without_npu_compile_scope():

      filename_queue = tf.train.string_input_producer(filenames,num_epochs=num_epochs)

      因?yàn)锳OE的問(wèn)題解決了,他的issue關(guān)閉,但是我的問(wèn)題還沒(méi)解決,所以我報(bào)了自己的issue4:

      Issue4報(bào)錯(cuò)

      提交issue:

      https://gitee.com/ascend/modelzoo/issues/I2AMHH

      Instructions for updating:

      Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.FixedLengthRecordDataset`.

      WARNING:tensorflow:From cifar10.py:305: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.

      Instructions for updating:

      Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

      2020-12-23 22:22:55.165648: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost

      2020-12-23 22:22:55.166302: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_5 success. [0 ms]

      2020-12-23 22:22:55.166352: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.

      2020-12-23 22:22:55.166546: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is True

      2020-12-23 22:22:55.166574: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1

      2020-12-23 22:22:55.166804: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_9 begin.

      2020-12-23 22:22:55.166829: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is True

      2020-12-23 22:22:55.166876: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1

      2020-12-23 22:22:55.167661: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 1, hasMakeIteratorOp:0, hasIteratorOp:0

      2020-12-23 22:22:55.167710: I tf_adapter/util/npu_ops_identifier.cc:67] [MIX] Parsing json from /home/HwHiAiUser/Ascend/ascend-toolkit/latest/arm64-linux/opp/framework/built-in/tensorflow/npu_supported_ops.json

      2020-12-23 22:22:55.169692: I tf_adapter/util/npu_ops_identifier.cc:69] 690 ops parsed

      2020-12-23 22:22:55.170185: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [2 ms]

      2020-12-23 22:22:55.176442: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1

      2020-12-23 22:22:55.176485: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 382, max nodes count: 377 in subgraph: GeOp9_0 minGroupSize: 1

      2020-12-23 22:22:55.176643: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_9 markForPartition success.

      Traceback (most recent call last):

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call

      return fn(*args)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn

      target_list, run_metadata)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun

      run_metadata)

      tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):

      File "cifar10.py", line 517, in

      extract_z(**config)

      File "cifar10.py", line 330, in extract_z

      sess.run(init_op)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run

      run_metadata_ptr)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run

      feed_dict_tensor, options, run_metadata)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run

      run_metadata)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call

      raise type(e)(node_def, op, message)

      tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer/limit_epochs/epochs/Assign is not in white list

      2020-12-23 22:22:56.194723: I tf_adapter/util/ge_plugin.cc:56] [GePlugin] destroy constructor begin

      2020-12-23 22:22:56.194890: I tf_adapter/util/ge_plugin.cc:195] [GePlugin] Ge has already finalized.

      2020-12-23 22:22:56.194990: I tf_adapter/util/ge_plugin.cc:58] [GePlugin] destroy constructor end

      后來(lái)無(wú)法復(fù)現(xiàn),就關(guān)閉了。但是今天又復(fù)現(xiàn)了。

      發(fā)現(xiàn)原來(lái)依瞳系統(tǒng)下面,python和python3.7指向竟然不是同一個(gè)。

      model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$? which python

      /usr/bin/python

      model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$? which python3.7

      /usr/local/bin/python3.7

      model_user14@f2e974f6-0696-4b25-874d-3053d19ba4e2:~/jk/tf-vqvae$ which pip

      /usr/local/bin/pip

      因此應(yīng)該用usr/local/bin這個(gè)目錄下的python,也就是python3.7

      檢查tf.train.batch里面的設(shè)置,尤其是allow_smaller_final_batch的設(shè)置,發(fā)現(xiàn)一共出現(xiàn)3處,唯一起作用那處已經(jīng)改成了False

      images,labels = tf.train.batch(

      [image,label],

      batch_size=BATCH_SIZE,

      num_threads=1,

      capacity=BATCH_SIZE,

      allow_smaller_final_batch=False)

      將代碼改成非aoe模式,因?yàn)閍oe他的規(guī)避方式不復(fù)合比賽的要求。

      將其它兩處train.Batch?改成了不下沉。混合計(jì)算里的不下沉,就是將不下沉的語(yǔ)句用這個(gè)語(yǔ)句包起來(lái):

      with npu_scope.without_npu_compile_scope():

      由于修改好幾個(gè)地方,程序被改的面目全非,因此又重新測(cè)試cpu代碼,發(fā)現(xiàn)cifar10的cpu代碼竟然也不通過(guò)了(這就是vqvae這個(gè)模型難的地方,改一點(diǎn)點(diǎn)地方,報(bào)錯(cuò)就不一樣,甚至感覺(jué)沒(méi)改哪些地方,它自己也會(huì)莫名其妙的不通了)。

      又廢了好大勁才終于又調(diào)通了cpu代碼。Cpu代碼程序單獨(dú)寫(xiě)為cifar_base.py。然后再比著調(diào)通npu程序代碼,也就是如果把npu相關(guān)代碼都屏蔽掉,npu程序也是能跑通的。

      Issue5的報(bào)錯(cuò),是shape沒(méi)有對(duì)齊

      https://gitee.com/ascend/modelzoo/issues/I2AVI5

      應(yīng)該是data batch那里沒(méi)有把最后一段丟棄的緣故,

      images,labels = tf.train.batch(

      [image,label],

      batch_size=BATCH_SIZE,

      num_threads=1,

      capacity=BATCH_SIZE,

      allow_smaller_final_batch=False)

      加上黑體部分就ok了。

      報(bào)issue5.1

      https://gitee.com/ascend/modelzoo/issues/I2B2US

      提到:麻煩以后出現(xiàn)DEVMM報(bào)錯(cuò)時(shí),敲dmesg獲取一下內(nèi)核日志,方便定位,謝謝~

      [ERROR] DEVMM(25538,python3.7):2020-12-27-12:52:10.065.876 [hardware/build/../dev_platform/devmm/devmm/devmm_svm.c:268][devmm_copy_ioctl 268] Ioctl(-1060090619) error! ret=-1, dst=0xfffed40af2a0, src=0x1008000bc000, size=112,

      但是這個(gè)問(wèn)題并不容易復(fù)現(xiàn),在我的系統(tǒng)里偶爾能復(fù)現(xiàn),在研發(fā)那塊復(fù)現(xiàn)也很困難。

      結(jié)果元旦后第一個(gè)工作日:新年新氣象,今天略微修改了下代碼,竟然跑通了,我都很驚訝。

      數(shù)據(jù)讀取部分用了混合計(jì)算不下沉,原則上沒(méi)有修改骨干代碼?,但是元旦那天還不行,今天稍微改了下代碼,就跑通了?。

      這個(gè)issue的問(wèn)題解決了,關(guān)閉。

      上面的記錄文字很短,其實(shí)這個(gè)issue花費(fèi)的時(shí)間非常多,從2020年的年尾,一直到2021年的年初,兩頭占著算兩年時(shí)間,中間代碼改的面目全非,bug的樣子也是日新月異,可以說(shuō)最后成功的喜悅有多大,中間的情緒低落就有多深。

      還沒(méi)有完成的issue6?報(bào)錯(cuò)

      這回的報(bào)錯(cuò)沒(méi)有提交issue6?。

      報(bào)錯(cuò)信息:

      Instructions for updating:

      Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

      WARNING:tensorflow:From cn.py:346: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.

      Instructions for updating:

      Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

      2021-01-04 16:05:37.573107: I tf_adapter/optimizers/get_attr_optimize_pass.cc:64] NpuAttrs job is localhost

      2021-01-04 16:05:37.574181: I tf_adapter/optimizers/get_attr_optimize_pass.cc:128] GetAttrOptimizePass_15 success. [0 ms]

      2021-01-04 16:05:37.574252: I tf_adapter/optimizers/mark_start_node_pass.cc:82] job is localhost Skip the optimizer : MarkStartNodePass.

      2021-01-04 16:05:37.574439: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:102] mix_compile_mode is True

      2021-01-04 16:05:37.574461: I tf_adapter/optimizers/mark_noneed_optimize_pass.cc:103] iterations_per_loop is 1

      2021-01-04 16:05:37.574658: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1763] OMPartition subgraph_29 begin.

      2021-01-04 16:05:37.574679: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1764] mix_compile_mode is True

      2021-01-04 16:05:37.574689: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1765] iterations_per_loop is 1

      2021-01-04 16:05:37.575437: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:354] FindNpuSupportCandidates enableDP:0, mix_compile_mode: 1, hasMakeIteratorOp:0, hasIteratorOp:0

      2021-01-04 16:05:37.575952: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:484] TFadapter find Npu support candidates cost: [0 ms]

      2021-01-04 16:05:37.582660: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:863] cluster Num is 1

      2021-01-04 16:05:37.582750: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:870] All nodes in graph: 382, max nodes count: 377 in subgraph: GeOp29_0 minGroupSize: 1

      2021-01-04 16:05:37.583336: I tf_adapter/optimizers/om_partition_subgraphs_pass.cc:1851] OMPartition subgraph_29 markForPartition success.

      Traceback (most recent call last):

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call

      return fn(*args)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn

      target_list, run_metadata)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun

      run_metadata)

      華為極客周活動(dòng)-昇騰萬(wàn)里--模型王者挑戰(zhàn)賽VQVAE調(diào)通過(guò)程 下

      tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer_2/limit_epochs/epochs/Assign is not in white list

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):

      File "cn.py", line 558, in

      extract_z(**config)

      File "cn.py", line 371, in extract_z

      sess.run(init_op)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run

      run_metadata_ptr)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run

      feed_dict_tensor, options, run_metadata)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run

      run_metadata)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call

      raise type(e)(node_def, op, message)

      tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output: input_producer_2/limit_epochs/epochs/Assign is not in white list

      2021-01-04 16:05:38.559795: I tf_adapter/util/ge_plugin.cc:56] [GePlugin] destroy constructor begin

      2021-01-04 16:05:38.560022: I tf_adapter/util/ge_plugin.cc:195] [GePlugin] Ge has already finalized.

      2021-01-04 16:05:38.560042: I tf_adapter/util/ge_plugin.cc:58] [GePlugin] destroy constructor end

      看到這句提示,是否要修改代碼呢?

      WARNING:tensorflow:From cn.py:347: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.

      Instructions for updating:

      Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

      WARNING:tensorflow:From cn.py:347: batch (from tensorflow.python.training.input) is deprecated and will be removed in a future version.

      Instructions for updating:

      Queue-based input pipelines have been replaced by `tf.data`. Use `tf.data.Dataset.batch(batch_size)` (or `padded_batch(...)` if `dynamic_pad=True`).

      現(xiàn)在報(bào)錯(cuò)信息為:

      During handling of the above exception, another exception occurred:

      Traceback (most recent call last):

      File "cn.py", line 559, in

      extract_z(**config)

      File "cn.py", line 372, in extract_z

      sess.run(init_op)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run

      run_metadata_ptr)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run

      feed_dict_tensor, options, run_metadata)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run

      run_metadata)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call

      raise type(e)(node_def, op, message)

      tensorflow.python.framework.errors_impl.InvalidArgumentError: Ref Tensors (e.g., Variables) output:?input_producer/limit_epochs/epochs/Assign is not in white list

      查找,發(fā)現(xiàn)有可能是沒(méi)有正確初始化導(dǎo)致的,于是加上這句試試:

      sess.graph.finalize()

      還是同樣的報(bào)錯(cuò)。

      Main代碼:

      init_op = tf.group(tf.global_variables_initializer(),

      tf.local_variables_initializer())

      # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Run!

      config = tf.ConfigProto()

      # config.gpu_options.allow_growth = True

      custom_op =? config.graph_options.rewrite_options.custom_optimizers.add()

      custom_op.name =? "NpuOptimizer"

      custom_op.parameter_map["use_off_line"].b = True #在昇騰AI處理器執(zhí)行訓(xùn)練

      custom_op.parameter_map["mix_compile_mode"].b =? True

      config.graph_options.rewrite_options.remapping = RewriterConfig.OFF? #關(guān)閉remap開(kāi)關(guān)

      sess = tf.Session(config=config)

      # sess.graph.finalize()

      sess.run(init_op)

      print("="*1000, "run sess.run(init_op) OK!")

      summary_writer = tf.summary.FileWriter(LOG_DIR,sess.graph)

      # logging.warning("dch summary_writer")

      summary_writer.add_summary(config_summary.eval(session=sess))

      # logging.warning("dch summary_writer.add")

      extract_z代碼:

      with npu_scope.without_npu_compile_scope():

      images,labels = tf.train.batch(

      [image,label],

      batch_size=BATCH_SIZE,

      num_threads=1,

      capacity=BATCH_SIZE,

      allow_smaller_final_batch=False)

      # <<<<<<<

      # images = images.batch(batch_size, drop_remainder=True)

      # >>>>>>> MODEL

      with tf.variable_scope('net'):

      with tf.variable_scope('params') as params:

      pass

      x_ph = tf.placeholder(tf.float32,[BATCH_SIZE,32,32,3])

      net= VQVAE(None,None,BETA,x_ph,K,D,_cifar10_arch,params,False)

      init_op = tf.group(tf.global_variables_initializer(),

      tf.local_variables_initializer())

      # >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Run!

      config = tf.ConfigProto()

      # config.gpu_options.allow_growth = True

      custom_op =? config.graph_options.rewrite_options.custom_optimizers.add()

      custom_op.name =? "NpuOptimizer"

      custom_op.parameter_map["use_off_line"].b = True #在昇騰AI處理器執(zhí)行訓(xùn)練

      custom_op.parameter_map["mix_compile_mode"].b =? True #?測(cè)試算子下沉

      config.graph_options.rewrite_options.remapping = RewriterConfig.OFF? #關(guān)閉remap開(kāi)關(guān)

      sess = tf.Session(config=config)

      logger.warn('warn sess = tf.Session(config=config)')

      # sess = tf.Session()

      sess.graph.finalize()

      sess.run(init_op)

      logger.warn('warn sess.run(init_op)')

      最終采用將這句話里的epoch=1參數(shù)去掉,終于能夠通過(guò)了?。

      # image,label = get_image(num_epochs=1)

      image,label = get_image()

      這個(gè)解決方法可能不是最終解決方法,先這樣處理。

      issue7報(bào)錯(cuò):

      是train_prior部分:? config['TRAIN_NUM'] = 8 # 9個(gè)之后會(huì)報(bào)錯(cuò)

      報(bào)issue:https://gitee.com/ascend/modelzoo/issues/I2BUME/

      報(bào)錯(cuò)信息:

      2021-01-04 20:30:06.952771: I tf_adapter/kernels/geop_npu.cc:573] [GEOP] RunGraphAsync callback, status:0, kernel_name:GeOp75_0[ 6456228us]

      50%|██████████████████████████████████████████████????????????????????????????????????????????? | 9/18 [01:36<01:36, 10.67s/it]

      Traceback (most recent call last):

      File "cifar10.py", line 531, in

      train_prior(config=config,**config)

      File "cifar10.py", line 476, in train_prior

      sess.run(sample_summary_op,feed_dict={sample_images:sampled_ims}),it)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run

      run_metadata_ptr)

      File "/usr/local/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1156, in _run

      (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))

      ValueError: Cannot feed value of shape (20, 32, 32, 3) for Tensor 'misc/Placeholder:0', which has shape '(1, 32, 32, 3)'

      這個(gè)問(wèn)題現(xiàn)在還沒(méi)有解決,完全不知道問(wèn)題出在哪里。?大約跟數(shù)據(jù)的喂入有關(guān)系,但是我目前解決不了,只能先設(shè)為8:config['TRAIN_NUM'] = 8保證整個(gè)程序能跑通,這個(gè)issue先留著吧。

      PR提交,最后的沖刺

      經(jīng)過(guò)艱苦卓絕的奮斗,終于迎來(lái)了模型大賽的曙光,整個(gè)模型能夠在升騰系統(tǒng)上跑通了,而且基本符合大賽的要求。后面就是一些微調(diào)了。

      審核方給的修改意見(jiàn)

      1?程序設(shè)置了對(duì)tf.train.string_input_producer這個(gè)的不下沉

      要把這個(gè)去掉,只開(kāi)啟混合計(jì)算

      2 VQ-VAE的網(wǎng)絡(luò)問(wèn)題:網(wǎng)絡(luò)結(jié)構(gòu)中數(shù)據(jù)預(yù)處理的方式是通過(guò)這個(gè)循環(huán)控制的,這個(gè)循環(huán)在數(shù)據(jù)達(dá)到上限后拋出異常,根據(jù)異常來(lái)結(jié)束處理,目前在昇騰產(chǎn)品執(zhí)行會(huì)core。請(qǐng)開(kāi)發(fā)者自行修改成其他控制流程:

      while not coord.should_stop():

      x,y = sess.run([images,labels])

      k = sess.run(net.k,feed_dict={x_ph:x})

      ks.append(k)

      ys.append(y)

      print('.', end='', flush=True)

      except tf.errors.OutOfRangeError:

      VQVAE PR最后的整改

      1?整個(gè)程序只啟動(dòng)混合計(jì)算,把單獨(dú)的不下沉設(shè)置全部去掉。(也就是最終的使用方法,理論上系統(tǒng)把支持的全部下沉,不支持的默認(rèn)就能不下沉,不需要用戶手動(dòng)設(shè)置)

      2?將while循環(huán)改成for循環(huán)

      for step in tqdm(xrange(TRAIN_NUM), dynamic_ncols=True):

      x,y = sess.run([images,labels])

      k = sess.run(net.k,feed_dict={x_ph:x})

      ks.append(k)

      ys.append(y)

      并設(shè)置循環(huán)步數(shù):

      config['TRAIN_NUM'] = 24

      再跟少芳那邊溝通了一下,第二部分能通過(guò)就是將get_image函數(shù)參數(shù)去掉解決的,反正已經(jīng)設(shè)置了循環(huán)步數(shù),這里應(yīng)該不影響整體。

      修改:?# image,label = get_image(num_epochs=1)

      修改為:?image,label = get_image()

      然后提交PR,終于PR驗(yàn)收通過(guò)啦!烏拉!非常激動(dòng)!結(jié)果并不重要,中間出現(xiàn)問(wèn)題、解決問(wèn)題的過(guò)程最重要。但是如果沒(méi)有結(jié)果,這篇文檔都師出無(wú)名,中間付出的精力可能就白白付出了,學(xué)到的東西可能也沒(méi)現(xiàn)在這么多、印象這么深刻。

      VQVAE 模型tensorflow遷移到升騰總結(jié)

      本次大賽主要經(jīng)歷了報(bào)名、模型選擇、模型遷移、排錯(cuò)、提交PR等幾個(gè)階段,具體過(guò)程如前面篇幅所講,一言難盡啊!

      本次模型遷移大賽是很好的一次學(xué)習(xí)和鍛煉的機(jī)會(huì),我原來(lái)對(duì)tensorflow一點(diǎn)都不懂,經(jīng)過(guò)這次比賽,不管懂不懂,反正代碼看了好多遍,tf程序的流程也懂了一點(diǎn)。升騰系統(tǒng)原來(lái)也只是在Modelarts的notebook和訓(xùn)練任務(wù)中有接觸,像這次這樣可以在依瞳系統(tǒng)里自由的安裝軟件、完全控制系統(tǒng)還是第一次。在排錯(cuò)的過(guò)程中,跟華為研發(fā)有了第一線接觸,為及時(shí)準(zhǔn)確的排錯(cuò)能力!對(duì)升騰系統(tǒng)和MindSpore AI框架充滿信心!

      模型大賽的白銀賽段很快就要來(lái)了,大家快準(zhǔn)備報(bào)名吧!

      大賽 昇騰

      版權(quán)聲明:本文內(nèi)容由網(wǎng)絡(luò)用戶投稿,版權(quán)歸原作者所有,本站不擁有其著作權(quán),亦不承擔(dān)相應(yīng)法律責(zé)任。如果您發(fā)現(xiàn)本站中有涉嫌抄襲或描述失實(shí)的內(nèi)容,請(qǐng)聯(lián)系我們jiasou666@gmail.com 處理,核實(shí)后本網(wǎng)站將在24小時(shí)內(nèi)刪除侵權(quán)內(nèi)容。

      上一篇:秒懂HTTPS接口(實(shí)現(xiàn)篇)
      下一篇:一篇文章搞定前端面試
      相關(guān)文章
      亚洲毛片无码专区亚洲乱| 亚洲高清在线mv| 久久精品亚洲AV久久久无码 | 亚洲欧美乱色情图片| 亚洲视频无码高清在线| 亚洲精品午夜久久久伊人| 91久久亚洲国产成人精品性色| 亚洲精品国产成人专区| 久久亚洲国产成人亚| 亚洲视频精品在线| 亚洲黄色免费电影| 亚洲国产精品综合久久网各| 亚洲白色白色在线播放| 亚洲中文字幕无码av在线| 亚洲一区精品视频在线| 亚洲成A人片在线播放器| 亚洲性无码AV中文字幕| 亚洲av成人一区二区三区在线播放| 亚洲AV永久无码天堂影院| 国产AV无码专区亚洲AV蜜芽 | 亚洲成a∧人片在线观看无码 | 亚洲国产精品自在线一区二区 | 国产亚洲视频在线播放大全| 亚洲日韩国产精品乱| 亚洲人成影院在线无码按摩店| 亚洲Av无码专区国产乱码DVD | 亚洲AV永久无码精品放毛片| 亚洲av成人一区二区三区在线播放| 国产亚洲成在线播放va| av在线亚洲欧洲日产一区二区| 亚洲精品乱码久久久久久自慰 | 亚洲Aⅴ在线无码播放毛片一线天| 国产亚洲视频在线观看| 在线观看亚洲av每日更新| 亚洲gv猛男gv无码男同短文| 亚洲人成电影在线天堂| 亚洲mv国产精品mv日本mv| 亚洲a∨国产av综合av下载| 亚洲午夜精品久久久久久浪潮 | 亚洲精品字幕在线观看| 亚洲人成影院在线|