Discussion:
Run Distributed TensorFlow on YARN
Sunil G
2018-11-07 06:05:14 UTC
Permalink
Hi Robert

{Submarine} project helps to run Distributed Tensorflow on top of YARN with
ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
early attempt to do the same with some scripts etc, but Submarine will help
to avoid all such custom scripts etc, and rather can simply run tensorflow
like a distributed shell command line by using Submarine jar. Pls refer
below doc for deep dive.
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7

Submarine will be released as part of Hadoop 3.2.0 release which will be
out very soon officially (in coming weeks). you are free to use hadoop
trunk to run same if you need very soon.

For now you can refer submarine docs under hadoop repo (trunk)
under hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
or(
https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
)

Thanks
Sunil
Hi all,
I am wondering if there is any stable support to run distributed
TensorFlow atop YARN at the moment.
I found this blog post from Hortonworks. It seems this it is possible
starting YARN 3.1.0.
https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
https://issues.apache.org/jira/browse/YARN-8220
https://issues.apache.org/jira/browse/YARN-8135
which suggests to use something called submarine.
However, I could not find any proper documentation or instructions to use
any of these.
Can someone help me with this?
Otherwise, it is any better support to run any other machine learning
framework with YARN?
Thank you in advance,- Robert
Jonathan Hung
2018-11-07 06:40:12 UTC
Permalink
Hi Robert, I also encourage you to check out https://github.com/linkedin/TonY (TensorFlow on YARN) which is a platform built for this purpose.

Jonathan
________________________________
From: Sunil G <***@apache.org>
Sent: Tuesday, November 6, 2018 10:05:14 PM
To: Robert Grandl
Cc: yarn-***@hadoop.apache.org; yarn-dev-***@hadoop.apache.org; General
Subject: Re: Run Distributed TensorFlow on YARN

Hi Robert

{Submarine} project helps to run Distributed Tensorflow on top of YARN with
ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
early attempt to do the same with some scripts etc, but Submarine will help
to avoid all such custom scripts etc, and rather can simply run tensorflow
like a distributed shell command line by using Submarine jar. Pls refer
below doc for deep dive.
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7

Submarine will be released as part of Hadoop 3.2.0 release which will be
out very soon officially (in coming weeks). you are free to use hadoop
trunk to run same if you need very soon.

For now you can refer submarine docs under hadoop repo (trunk)
under hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
or(
https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
)

Thanks
Sunil
Hi all,
I am wondering if there is any stable support to run distributed
TensorFlow atop YARN at the moment.
I found this blog post from Hortonworks. It seems this it is possible
starting YARN 3.1.0.
https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
https://issues.apache.org/jira/browse/YARN-8220
https://issues.apache.org/jira/browse/YARN-8135
which suggests to use something called submarine.
However, I could not find any proper documentation or instructions to use
any of these.
Can someone help me with this?
Otherwise, it is any better support to run any other machine learning
framework with YARN?
Thank you in advance,- Robert
Wangda Tan
2018-11-08 19:55:46 UTC
Permalink
Forgot to add Xun in my last email.
Post by Jonathan Hung
Hi Robert,
Submarine in 3.2.0 only support Docker container runtime, and in future
releases (maybe 3.2.1), we plan to add support for non-docker containers.
In order to try Submarine, you need to properly configure docker-on-yarn
first.
You can check
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/InstallationScriptEN.md
for installation guide about how to properly setup Docker container on
multiple containers. Submarine embedded an interactive shell to help you
set up this should be straightforward. Added Xun Liu who is the original
author for the installation interactive shell.
Once you get Docker on YARN properly set up, you can follow
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/QuickStart.md
to run the first application.
Also, you can check Submarine slides to better understand how it works.
See: https://www.dropbox.com/s/wuv19b3rt9k2kq6/submarine-v0.pptx?dl=0
Any questions please don't hesitate to let us know.
Thanks,
Wangda
Thanks a lot for your reply.
Sunil,
https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/RunningDistributedCifar10TFJobs.md
to run the tensorflow standalone using submarine. I have installed hadoop
3.3.0-SNAPSHOT.
However, when I run the:yarn jar
path/to/hadoop-yarn-applications-submarine-3.2.0-SNAPSHOT.jar \
job run --name tf-job-001 --verbose --docker_image
hadoopsubmarine/tf-1.8.0-gpu:0.0.1 \
--input_path hdfs://default/dataset/cifar-10-data \
--env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
--env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.1.0
--num_workers 1 --worker_resources memory=8G,vcores=2,gpu=1 \
--worker_launch_cmd "cd /test/models/tutorials/image/cifar10_estimator
&& python cifar10_main.py --data-dir=%input_path%
--job-dir=%checkpoint_path% --train-steps=10000 --eval-batch-size=16
--train-batch-size=16 --num-gpus=2 --sync" \
--tensorboard --tensorboard_docker_image wtan/tf-1.8.0-cpu:0.0.3
command, I get the following error:2018-11-07 21:48:55,831 INFO [main]
client.AHSProxy (AHSProxy.java:createAHSProxy(42)) - Connecting to
Application History server at /128.105.144.236:10200Exception in thread
"main" java.lang.IllegalArgumentException: Unacceptable no of cpus
specified, either zero or negative for component master (or at the global
level) at
org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateServiceResource(ServiceApiUtil.java:457)
at
org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateComponent(ServiceApiUtil.java:306)
at
org.apache.hadoop.yarn.service.utils.ServiceApiUtil.validateAndResolveService(ServiceApiUtil.java:237)
at
org.apache.hadoop.yarn.service.client.ServiceClient.actionCreate(ServiceClient.java:496)
at
org.apache.hadoop.yarn.submarine.runtimes.yarnservice.YarnServiceJobSubmitter.submitJob(YarnServiceJobSubmitter.java:542)
at
org.apache.hadoop.yarn.submarine.client.cli.RunJobCli.run(RunJobCli.java:231)
at org.apache.hadoop.yarn.submarine.client.cli.Cli.main(Cli.java:94)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at
org.apache.hadoop.util.RunJar.run(RunJar.java:323) at
org.apache.hadoop.util.RunJar.main(RunJar.java:236)
It seems that I don't configure somewhere some corresponding resources
for a master component. However I have a hard time understanding where and
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
and it has a --master_resources flag. However this is not available in
3.3.0.
Could you please advise how to proceed with this?
Thank you,- Robert
On Tuesday, November 6, 2018, 10:40:20 PM PST, Jonathan Hung <
Hi Robert, I also encourage you to check out
https://github.com/linkedin/TonY (TensorFlow on YARN) which is a
platform built for this purpose.
Jonathan
________________________________
Sent: Tuesday, November 6, 2018 10:05:14 PM
To: Robert Grandl
Subject: Re: Run Distributed TensorFlow on YARN
Hi Robert
{Submarine} project helps to run Distributed Tensorflow on top of YARN with
ease. YARN-8220 <https://issues.apache.org/jira/browse/YARN-8220> was an
early attempt to do the same with some scripts etc, but Submarine will help
to avoid all such custom scripts etc, and rather can simply run tensorflow
like a distributed shell command line by using Submarine jar. Pls refer
below doc for deep dive.
https://docs.google.com/document/d/199J4pB3blqgV9SCNvBbTqkEoQdjoyGMjESV4MktCo0k/edit#heading=h.vkxp9edl11m7
Submarine will be released as part of Hadoop 3.2.0 release which will be
out very soon officially (in coming weeks). you are free to use hadoop
trunk to run same if you need very soon.
For now you can refer submarine docs under hadoop repo (trunk)
under
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown/
or(
https://github.com/apache/hadoop/tree/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-submarine/src/site/markdown
)
Thanks
Sunil
Hi all,
I am wondering if there is any stable support to run distributed
TensorFlow atop YARN at the moment.
I found this blog post from Hortonworks. It seems this it is possible
starting YARN 3.1.0.
https://hortonworks.com/blog/distributed-tensorflow-assembly-hadoop-yarn/
https://issues.apache.org/jira/browse/YARN-8220
https://issues.apache.org/jira/browse/YARN-8135
which suggests to use something called submarine.
However, I could not find any proper documentation or instructions to
use
any of these.
Can someone help me with this?
Otherwise, it is any better support to run any other machine learning
framework with YARN?
Thank you in advance,- Robert
Loading...